setting up optimization problem Flashcards

Question 1

Q

what is normalizing input

Answer

A

subtract mean x-mu
notmalize variance x/std dev

Question 2

Q

why normalize input

Answer

A

range of fts may be very different (diff scale) –> some can affect the cost function more than others

Question 3

Q

what is exploding / vanishing gradients

Answer

A

when the slopes can be very big/small
y=wl wl-1 …. w2 w1
as we propagate down the model, there are multiplication of more derivatives –> if large derivatives our gradient will be too too large, if small –> too small

Question 4

Q

what happens when exploding/vanishing gradients

Answer

A

exploding: very unstable network due to the large change in weights, might cause overflow in weights then no longer can be updated
vanishing: cannot learn insights since weights and biases of intial layers worst case gradients will be 0 and stop learning

Question 5

Q

how to know exploding gradients

Answer

A

model weights become very large and can be NaN, avalanche learning

Question 6

Q

how to know vanishing gradients

Answer

A

params of higher layers change significantly while the deeper layers not change much
model weight may become 0 while learning
learns very slow, poor loss, can stop very early

Question 7

Q

what are the ways to prevent exploding/vanishing gradients?

Answer

A

proper weight initialization
using non saturating activation function
batch normalization
gradient clipping
reduce amount of layers but also give up some complexity

Question 8

Q

how to initialize weights

Answer

A

conditions to try to hold on to:
1. variance of output of each layer should be equal to variance of its input
2. the gradients should have equal variance before and after flowing through a layer in reverse direction

impossible to hold unless no of inputs to layer (fanin) = no of neurons in the layer (fanout)

Question 9

Q

what is Xavier initialization?

Answer

A

no of inputs to the layer: fanin
no of neurons in the layer: fanout

fanavg = (fanin+fanout)/2
normal dist w mean = 0 and var = 1/fanavg
activation fct: none, tanh, logistic, softmax

Question 10

Q

what is He initialization

Answer

A

normal dist w mean 0 and var = 2/fanin
activation fct: relu and its variants

Question 11

Q

what is Lecun normalization

Answer

A

normal dist w mean 0 and var 1/fanin
activation fct: SELU

Question 12

Q

what is saturating activation functions?

Answer

A

large input -> a const
small input -> a const
eg. sigmoid and tanh
–> reason behind exploding and vanishing gradients

Question 13

Q

what are non saturating functions?

Answer

A

ReLU and its alternatives (leaky relu, selu…)

Question 14

Q

what is gradient clipping

Answer

A

limit every component of the gradient between -1 and 1
all partial derivatives of loss will be btw -1 and 1 (or other defined value)
threshold is tunable (clip value)

if clip by value, might lose orientation –> better clip by norm (L2 norm)
eg. grad = [0.9, 100] clipnorm =1 –> [0.00799, 0.999995]

setting up optimization problem Flashcards

(14 cards)