setting up optimization problem Flashcards

1
Q

what is normalizing input

A

subtract mean x-mu
notmalize variance x/std dev

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

why normalize input

A

range of fts may be very different (diff scale) –> some can affect the cost function more than others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is exploding / vanishing gradients

A

when the slopes can be very big/small
y=wl wl-1 …. w2 w1
as we propagate down the model, there are multiplication of more derivatives –> if large derivatives our gradient will be too too large, if small –> too small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what happens when exploding/vanishing gradients

A

exploding: very unstable network due to the large change in weights, might cause overflow in weights then no longer can be updated
vanishing: cannot learn insights since weights and biases of intial layers worst case gradients will be 0 and stop learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how to know exploding gradients

A

model weights become very large and can be NaN, avalanche learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how to know vanishing gradients

A

params of higher layers change significantly while the deeper layers not change much
model weight may become 0 while learning
learns very slow, poor loss, can stop very early

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are the ways to prevent exploding/vanishing gradients?

A
  1. proper weight initialization
  2. using non saturating activation function
  3. batch normalization
  4. gradient clipping
  5. reduce amount of layers but also give up some complexity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how to initialize weights

A

conditions to try to hold on to:
1. variance of output of each layer should be equal to variance of its input
2. the gradients should have equal variance before and after flowing through a layer in reverse direction

impossible to hold unless no of inputs to layer (fanin) = no of neurons in the layer (fanout)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is Xavier initialization?

A

no of inputs to the layer: fanin
no of neurons in the layer: fanout

fanavg = (fanin+fanout)/2
normal dist w mean = 0 and var = 1/fanavg
activation fct: none, tanh, logistic, softmax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is He initialization

A

normal dist w mean 0 and var = 2/fanin
activation fct: relu and its variants

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is Lecun normalization

A

normal dist w mean 0 and var 1/fanin
activation fct: SELU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is saturating activation functions?

A

large input -> a const
small input -> a const
eg. sigmoid and tanh
–> reason behind exploding and vanishing gradients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are non saturating functions?

A

ReLU and its alternatives (leaky relu, selu…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is gradient clipping

A

limit every component of the gradient between -1 and 1
all partial derivatives of loss will be btw -1 and 1 (or other defined value)
threshold is tunable (clip value)

if clip by value, might lose orientation –> better clip by norm (L2 norm)
eg. grad = [0.9, 100] clipnorm =1 –> [0.00799, 0.999995]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly