setting up optimization problem Flashcards
what is normalizing input
subtract mean x-mu
notmalize variance x/std dev
why normalize input
range of fts may be very different (diff scale) –> some can affect the cost function more than others
what is exploding / vanishing gradients
when the slopes can be very big/small
y=wl wl-1 …. w2 w1
as we propagate down the model, there are multiplication of more derivatives –> if large derivatives our gradient will be too too large, if small –> too small
what happens when exploding/vanishing gradients
exploding: very unstable network due to the large change in weights, might cause overflow in weights then no longer can be updated
vanishing: cannot learn insights since weights and biases of intial layers worst case gradients will be 0 and stop learning
how to know exploding gradients
model weights become very large and can be NaN, avalanche learning
how to know vanishing gradients
params of higher layers change significantly while the deeper layers not change much
model weight may become 0 while learning
learns very slow, poor loss, can stop very early
what are the ways to prevent exploding/vanishing gradients?
- proper weight initialization
- using non saturating activation function
- batch normalization
- gradient clipping
- reduce amount of layers but also give up some complexity
how to initialize weights
conditions to try to hold on to:
1. variance of output of each layer should be equal to variance of its input
2. the gradients should have equal variance before and after flowing through a layer in reverse direction
impossible to hold unless no of inputs to layer (fanin) = no of neurons in the layer (fanout)
what is Xavier initialization?
no of inputs to the layer: fanin
no of neurons in the layer: fanout
fanavg = (fanin+fanout)/2
normal dist w mean = 0 and var = 1/fanavg
activation fct: none, tanh, logistic, softmax
what is He initialization
normal dist w mean 0 and var = 2/fanin
activation fct: relu and its variants
what is Lecun normalization
normal dist w mean 0 and var 1/fanin
activation fct: SELU
what is saturating activation functions?
large input -> a const
small input -> a const
eg. sigmoid and tanh
–> reason behind exploding and vanishing gradients
what are non saturating functions?
ReLU and its alternatives (leaky relu, selu…)
what is gradient clipping
limit every component of the gradient between -1 and 1
all partial derivatives of loss will be btw -1 and 1 (or other defined value)
threshold is tunable (clip value)
if clip by value, might lose orientation –> better clip by norm (L2 norm)
eg. grad = [0.9, 100] clipnorm =1 –> [0.00799, 0.999995]