Week 3 Flashcards
Rate the importance of different hyper-parameters to tune in NN?
|Tuning process
01:10
1-Alpha (learning rate)
2-momentum (usually 0.9), hidden units, mini-batch size
3- number of layers, learning rate decay
How do you select a set of values to explore for multiple hyper parameters in NN?
A) Random
B) Grid
|Tuning process
02:33
A
What is coarse to fine sampling scheme in hyper parameter tuning?
|Tuning process
05:31
When you find a few values for your hyper parameters that work well, then you zoom in and sample from a smaller region for your hyper parameters
Why doesn’t it make sense to sample uniformly for hyper-parameters such as alpha (learning rate) that have small values? What’s the solution to this?
|Using an appropriate scale to pick hyper parameters
01:55
Say we are looking at values between 0.0001 to 1, if we sample uniformly, 90% of the samples are going to be between 0.1 to 1, so it’s more reasonable to search for the hyper parameter on a log scale, on a log scale, we would choose between -4 and 0.
When the scale is small, for random sampling, first transform the scale using a log for example
In batch normalization, should you normalize the value before activation function ( normalize z), or after activation function (normalize a)?
|Normalizing activations in a network
02:33
In practice it’s more common to normalize z
What are the steps of implementing Gradient Descent using Batch Normalization? Assuming we’re using mini-batch
|Fitting batch norm into a Neural Network
09:16
For t=1 to #mini-batches:
1)Compute forward propagation on X{t}
2) In each layer use batch normalization to replace Z[L] with ~Z[L] ( normalized version of Z[L], Note that ~Z[L] can have a different mean and SD, using Gamma×~Z[L]+Beta, but first it’s normalized using its mean and SD so the mean and SD of ~Z[L] is 0,1 respectively at first, this normalization eliminates the effect of B, and Beta technically does the same thing and is considered as the bias)
3) Back propagation to compute dW[L],dB[L],dGamma[L],dBeta[L] ( technically, d[B] doesn’t matter, since it’s eliminated)
4) Update the parameters using the formula for Gradient Descent ( or Adam or Gradient Descent w/ Momentum or RMSprop
Is Batch Normalization used for regularization purposes in NNs?
|Why does batch norm work?
10:53
No, it has a slight regularization effect but it’s not used for regularization, it’s used to speed up the learning.
How do we use Batch Normalization in testing stage?
Note: testing stage is NOT batch by batch, it’s a single instance. For batch normalization, we normalized the whole batch(mean=0, SD=1) Zs in each layer,then transformed it using Gamma and Beta.
|Batch Norm at test time
04:41
We use exponentially weighted average of the mean and SD of the training batches as the mean and SD of test instances.