Deep Learning Flashcards
What is an activation function? Why is it important?
The activation function is the function the neuron applies to the weighted sum of its inputs. It is important because it is how you introduce nonlinearity, since the function can be anything.
What is a step function?
A function that produces binary output based on some threshold.
What is a perceptron?
A perceptron is a neuron with the step function z > 0.
What does MLP stand for?
Multi-layer perceptron.
What is a multi-layer perceptron?
It’s when you chain together multiple layers of logic gates built from perceptrons, to represent complex logical functions.
What is “universality”?
If a network has universality, then it can approximate any continuous function arbitrarily with just one hidden layer, given enough units.
The definition of universality talks about one hidden layer. If that’s the case, why would you want multiple layers?
You can represent things more compactly if you use multiple layers.
What does “differentiable” mean?
A function is differentiable if you can find the slope of its tangent at any point.
What is ReLU?
ReLU is an activation function that is 0 for x < 0 and x for x > 0.
Why do we like to use ReLU instead of step function?
The step function has no useful gradient since it is just flat everywhere. ReLU does have a useful gradient.
What is risk? What is empirical risk?
Risk is the actual expected loss over the entire real distribution of data. We don’t know the real distribution of data, we just have our training set, so we only have empirical risk (the expected loss over the training set distribution).
What’s the point of the VC Dimension?
The VC dimension proves that the generalization bound is finite, even for an infinite hypothesis set. Generalization bound is risk <= empiricalRisk + error. And VC dimension says that error is finitely bounded.
What is the objective function?
It’s the loss function that we want to minimize.
What is the vanishing gradient problem?
It’s when the gradient goes to zero – which often happens because you’re multiplying a lot of tiny weights together.
What is it called when the gradient goes to zero?
The vanishing gradient problem.
What is “over-saturation”?
It’s when the activation function is in a flat region – such as < 0 for the ReLU – so the gradient is zero, which causes the vanishing gradient problem.
Give three solutions to the vanishing gradient problem
Any of these four: Better initialization, Better activation function, Regularization to rescale the gradient, Gradient clipping
What does it mean if a model is “non-identifiable”?
It means that the model can get the same minimum loss with multiple different settings of the weights.