Deep Learning Flashcards by Andrew Latham

What is an activation function? Why is it important?

The activation function is the function the neuron applies to the weighted sum of its inputs. It is important because it is how you introduce nonlinearity, since the function can be anything.

How well did you know this?

Not at all

Perfectly

What is a step function?

A function that produces binary output based on some threshold.

How well did you know this?

Not at all

Perfectly

What is a perceptron?

A perceptron is a neuron with the step function z > 0.

How well did you know this?

Not at all

Perfectly

What does MLP stand for?

Multi-layer perceptron.

How well did you know this?

Not at all

Perfectly

What is a multi-layer perceptron?

It’s when you chain together multiple layers of logic gates built from perceptrons, to represent complex logical functions.

How well did you know this?

Not at all

Perfectly

What is “universality”?

If a network has universality, then it can approximate any continuous function arbitrarily with just one hidden layer, given enough units.

How well did you know this?

Not at all

Perfectly

The definition of universality talks about one hidden layer. If that’s the case, why would you want multiple layers?

You can represent things more compactly if you use multiple layers.

How well did you know this?

Not at all

Perfectly

What does “differentiable” mean?

A function is differentiable if you can find the slope of its tangent at any point.

How well did you know this?

Not at all

Perfectly

What is ReLU?

ReLU is an activation function that is 0 for x < 0 and x for x > 0.

How well did you know this?

Not at all

Perfectly

Why do we like to use ReLU instead of step function?

The step function has no useful gradient since it is just flat everywhere. ReLU does have a useful gradient.

How well did you know this?

Not at all

Perfectly

What is risk? What is empirical risk?

Risk is the actual expected loss over the entire real distribution of data. We don’t know the real distribution of data, we just have our training set, so we only have empirical risk (the expected loss over the training set distribution).

How well did you know this?

Not at all

Perfectly

What’s the point of the VC Dimension?

The VC dimension proves that the generalization bound is finite, even for an infinite hypothesis set. Generalization bound is risk <= empiricalRisk + error. And VC dimension says that error is finitely bounded.

How well did you know this?

Not at all

Perfectly

What is the objective function?

It’s the loss function that we want to minimize.

How well did you know this?

Not at all

Perfectly

What is the vanishing gradient problem?

It’s when the gradient goes to zero – which often happens because you’re multiplying a lot of tiny weights together.

How well did you know this?

Not at all

Perfectly

What is it called when the gradient goes to zero?

The vanishing gradient problem.

How well did you know this?

Not at all

Perfectly

What is “over-saturation”?

It’s when the activation function is in a flat region – such as < 0 for the ReLU – so the gradient is zero, which causes the vanishing gradient problem.

How well did you know this?

Not at all

Perfectly

Give three solutions to the vanishing gradient problem

Any of these four: Better initialization, Better activation function, Regularization to rescale the gradient, Gradient clipping

How well did you know this?

Not at all

Perfectly

What does it mean if a model is “non-identifiable”?

It means that the model can get the same minimum loss with multiple different settings of the weights.

How well did you know this?

Not at all

Perfectly

Why is it a problem is a model is non-identifiable?

Study These Flashcards

If we were using weights to draw conclusions about the model, having two different sets of weights both claim to be optimal would make that difficult.

What does it mean if a function is “convex”?

Study These Flashcards

It means that it has a single minimum, and a bowl shape going towards that minimum. Specifically, it means that if you pick any two points and draw a line between them, that line won’t intersect the function.

Are deep neural networks convex?

Study These Flashcards

Absolutely not.

Give three examples of problems caused by nonconvexity

Study These Flashcards

Any of: Local minima, Saddle points, Cliff structures, Asymptotes

Why don’t you want to just initialize all the weights to zero?

Study These Flashcards

For ReLU, this yields zero gradients. In general, this means every hidden unit with the same activation function would have the same output.

What is the most common way to initialize?

Study These Flashcards

Sampling from a distribution

What is “warm start”?

It’s where you train the model on a different objective just to get some initialization weights, and then switch to the actual objective.

What is dropout regularization?

Randomly remove hidden units by hardcoding their activation to 0, randomly choosing in each iteration of gradient descent. This simulates a bunch of different models.

What is ensemble regularization?

Train a bunch of different models and average their predictions.

What is variational noise and why would you do it?

It’s when you add noise from a Gaussian distribution during the forward pass part of training. It’s a way to regularize.

What is a convolution?

It’s a window-based operation where you combine two vectors to extract spatial or contextual information.

What is the vector you use to do the convolution called?

The kernel

What is the kernel, in the context of CNNs?

It’s the vector you use to define and apply the convolution to your input

What does CNN stand for?

Convolutional Neural Network

What is a Convolutional Neural Network?

It’s a neural network where you have a convolutional layer.

Why would you want to use a CNN?

It’s able to find local structure, and interactions between nearby components.

What is “pooling”?

It’s where you combine the outputs of a convolutional layer, with something like average or L2 norm, to zoom out on the convolution region.

What does RNN stand for?

Recurrent Neural Network

What is unique about a Recurrent Neural Network?

It captures information about a sequence of values, by adding a loop to the hidden unit.

What is the “transition function” in the context of an RNN?

The transition function is the term for the weights, because they are constant and don’t vary with time.

What is the “unrolled network”?

It’s an RNN where you’ve expanded the loop into a bunch of iterations side by side – so the network at time 1, then feeding into itself at time 2, and then time 3. It ends up looking like a ladder.

What does LSTM stand for?

Long short-term memory

What is unique about an LSTM?

It has memory units that hang onto information over some duration.

What does GAN stand for?

Generative Adversarial Network

What are the two parts of a GAN?

First, a Generator creates fake examples to try and fool the discriminator. Then the Discriminator learns to discern real examples from fake generated ones. And the two are learned together.

What is an “adversarial example”?

It’s one that has been minimally changed from a real example, but that causes the network to produce a very different prediction.

What is “dataset augmentation”?

It’s when you add a bunch of new manually-concocted examples to your training set to try and help the model generalize.

What are example of how people do dataset augmentation?

They might crop the data, or rotate it, or add noise, or erase random bits.

Deep Learning Flashcards

(46 cards)