Week 1 Flashcards
What are the suggested methods for handling bias and variance?
Basic Recipe for Machine Learning 00:49
For bias (when the performance is bad on the training data)
- Bigger network
- (Sometimes)Train longer
- Advanced optimization algorithms
- (Maybe) find a better NN architecture
For variance (when the performance is good on training, bad on dev)
- More data
- Regularization
- (sometimes) NN architecture
Getting a bigger network almost always just reduces your bias, without necessarily hurting your variance, so long as you regularize appropriately. And getting more data, pretty much always reduces your variance and doesn’t hurt your bias much. True/False
Basic Recipe for Machine Learning 04:31
True
If you use L1 regularization, then the matrix of weights (W) will be sparse. True/False
regularization 02:59
True
L2 norm for matrices is the sum of square of elements of a matrix and by convention, this is called the ____ norm.
Regularization 05:53
Frobenius
L2 regularization is sometimes also called weight decay. This statement is True/False.
Regularization 07:31
True
By far the most common implementation of dropouts today is ____. (External) why is it so common?
Dropout Regularization 06:18
Inverted dropouts
Inverted dropout is more common because it makes the testing much easier.
Inverted dropout is a variant of the original dropout technique developed by Hinton et al.
Just like traditional dropout, inverted dropout randomly keeps some weights and sets others to zero. This is known as the “keep probability” p.
The one difference is that, during the training of a neural network, inverted dropout scales the activations by the inverse of the keep probability q=1−p.
This
1) prevents network’s activations from getting too large,
2) and does not require any changes to the network during evaluation (in testing).
(Also, according to the video, it keeps the expected value of the activations fixed)
In contrast, traditional dropout requires scaling to be implemented during the test phase.
For layers where you’re more worried about overfitting really the layers with a lot of parameters you could say keep probability to be ____(larger/smaller) to apply a more powerful form of dropout.
Understanding Dropout 03:43
smaller
Drop out is very frequently used in the field of ____
Understanding Dropout 05:12
Computer vision
What’s one big downside of drop out?
Understanding drop out 05:53, 06:33
One big downside of drop out is that the cost function J is no longer well defined on every iteration.
What are 3 ways of reducing overfitting in Computer Vision (CV) problems?
Other Regularization Methods (whole video)
1- Flipping the pictures horizontally
2-Random crops of the images (rotating, zooming in, distortion for digits, etc.)
3-Early stopping
What’s an alternative to early stopping? what’s its downside?
Other Regularization Methods 06:57
- Rather than using early stopping, one alternative is just use L2 regularization then you can just train the neural network as long as possible. I find that this makes the search space of hyper parameters easier to decompose, and easier to search over.
- But the downside of this though is that you might have to try a lot of values of the regularization parameter lambda. And so this makes searching over many values of lambda more computationally expensive.
What happens to the weights of a NN as it keeps iterating and learning?
Other Regularization Methods 03:48
when you’ve haven’t run many iterations for your neural network yet, your parameter W will be close to zero. Because with random initialization you probably initialize W to small random values so before you train for a long time, W is still quite small. And as you iterate, as you train, W will get bigger and bigger and bigger until at some point maybe you have a much larger value of the parameter W for your neural network.
What early stopping does is by stopping halfway you have only a mid-size rate W. And so similar to L2 regularization by picking a neural network with smaller norm for your parameter W, hopefully your neural network is over fitting less.
What’s the downside of early stopping?
Other Regularization Methods 04:45
Think of the machine learning process as comprising several different steps.
- One, is that you want an algorithm to optimize the cost function j and we have various tools to do that, such as gradient descent.
- After optimizing the cost function j, you also want it to not over-fit. And we have some tools to do that such as regularization, getting more data and so on.
Machine learning is easier to think about when you have one set of tools for optimizing the cost function J and all you care about is finding w and b, so that J(w,b) is as small as possible.
And then it’s a completely separate task to not over fit, (to reduce variance). So when you’re doing that, you have a separate set of tools for doing it.
This principle is sometimes called orthogonalization.
The main downside of early stopping is that it couples these two tasks. So you no longer can work on these two problems independently, because by stopping gradient decent early, you’re sort of breaking whatever you’re doing to optimize cost function J, And then you also simultaneously trying to not over fit.
So instead of using different tools to solve the two problems, you’re using one that kind of mixes the two. And this just makes the set of things you could try, be more complicated to think about.
What is the effect of normalizing the input of a NN?
Normalizing Inputs 00:00
When training a neural network, one of the techniques to speed up your training is if you normalize your inputs.
Very deep neural networks can have the problems of vanishing and exploding gradients. It turns out that a partial solution to this, doesn’t solve it entirely but helps a lot, is ____
Weight Initialization for Deep Networks 00:00
better or more careful choice of the random initialization for your neural network.