batch normalization Flashcards
why we should do BN
when there r many layers, input for later layers may be shifted too much –> keep the inputs kinda stable throughout the training
–> speed up learning
BN also has a small regularization effect
BN process?
calculate mean and std dev of the input batch
znorm = z-mu/sqrt(stddev^2+epsilon)
z~= gamma×znorm + beta
–> use z~ instead of z
gamma and beta are learnable and only used for that particular batch
why gamma and beta should be learnable?
we might not a fixed distribution of inputs sometimes (like std normal dist of inputs be4 going thru sigmoid)
–> gamma and beta help control the distribution of input
does it make sense to use b (bias) while using BN
no. adding a constant to z will be canceled out when calculating znorm anyway –> no point
why BN has regularization effect?
gamma and beta are customized for a particular minibatch –> introducing some noise to the training
BN at test time?
use exponentially weighted avg to get the values of gamma and beta to use at test time