optimization algos Flashcards

Question 1

Q

what is batch and mini batch gradient descent?

Answer

A

if data too large in number of samples –> split into minibatches
batch: entire training set at the same time
minibatch: 1 smaller batch at a time. each step of training we only process 1 minibatch, update weight then move on to the next batch

Question 2

Q

is minibatch faster

Answer

A

when training set is large, minibatch runs much faster

Question 3

Q

difference in cost during training btw batch and mini batch

Answer

A

batch: only goes down
mini batch: fluctuate w a downward trend

Question 4

Q

different batch sizes?

Answer

A

batch grad des: size = m smooth but take long
stochastic gd: size = 1 can be xtremely noisy but fast (can use small alpha) but will lose speedup from variation
minibatch: some size in between fastest learning but also make progress (usually works better than the other 2)

Question 5

Q

how to choose minibatch size?

Answer

A

small training: batch gd
typical size: 64, 128, 256, 512 (common)
1024
make sure it fits in GPU or CPU memory

Question 6

Q

goal of gradient descent

Answer

A

minimize the loss function for a ML prob

Question 7

Q

idea of basic GD algo

Answer

A

the opposite direction of the gradient points to where the lower area is –> iteratively take steps in the opposite direction of the gradients

Question 8

Q

vanilla GD algo

Answer

A

delta = -learning_rate * gradient
theta += delta

theta is the param we want to optimize

Question 9

Q

momentum algo?

Answer

A

in each step, in addition to the regular gradient, it also adds on the movement from the prev step
sum_of_gradient = gradient + prev_sum_of_gradient×decay_rate
delta = -lr×sum_of_gradient
theta += delta

Question 10

Q

what if decay rate = 0? = 1? what are the reasonable decay rate?

Answer

A

decay rate = 0 –> vanilla GD (no momentum)
decay rate = 1 –> cost goes back and forth endlessly (given reasonably small lr)
typically 0.8 0.9 (surface with a little bit of friction so it eventually slows down and stops)

Question 11

Q

how is momentum better than vanilla GD?

Answer

A

simply moves faster (all the momentum it accumulates)
has a shot at escaping local minima (momentum may propel it out of a local minimum)

Question 12

Q

AdaGrad (adaptive gradient) algo

Answer

A

adagrad keeps track of the sum of gradient squared
sum_of_gradient_squared = prev_sum_of_gradient_squared + gradient^2
delta = -lr*gradient/sqrt(sum_of_gradient_squared)
theta += delta

Question 13

Q

idea of AdaGrad

Answer

A

in ml optimization, some features are very sparse –> avg gradient for sparse features is usually small so such fts get trained at a much slower rate
adagrad: the more you have updated a feature alr, the less you will update it in the future –> sparse fts can catch up
–> escape saddle point faster

Question 14

Q

what are some gradient squared based methods

Answer

A

adagrad, rmsprop, adam

Question 15

Q

gradient squared based methods vs vanilla GD?

Answer

A

vanilla: slide down the steep slope first then maybe worry ab the slower direction later
gsb methods: slow down the fast learning features so that sparse features will catch up –> update all fts similarly at a step

Question 16

Q

drawbacks of adagrad?

Answer

Study These Flashcards

A

very slow because the sum of gradient squared only grows and nvr shrinks

Question 17

Q

RMSProp (Root Mean Square Propagation) algo?

Answer

Study These Flashcards

A

sum_of_gradient_squared = prev_sum_of_gradient_squared×decay_rate + gradient^2×(1-decay_rate)
delta = -lr×gradient/sqrt(sum_of_gradient_squared)
theta += delta

Question 18

Q

how rmsprop tackle adagrad issue? (the idea of decay rate)

Answer

Study These Flashcards

A

it introduces decay rate –> sum of gradient squared actually the decayed sum of gradient squared.
decay rate is saying that only recent gradient matters, the ones from long ago will be forgotten
in addition to decaying the prev sum of gs, it also scales down the whole sum of gs by factor of 1-decay_rate –> step is larger –> eventually faster than adagrad

Question 19

Q

adam (adaptive moment estimation) algo?

Answer

Study These Flashcards

A

take the best of momentum and rmsprop
Momentum: sum_of_grad=prev_sum_of_grad×beta1+grad×(1-beta1)
RMSprop: sum_of_grad_squared = prev_sum_of_grad_squared×beta2+grad^2×*1-beta2)
delta = -lr×sum_of_grad/sqrt(sum_of_grad_squared)
theta += delta

beta1: decay rate for the first moment, sum of grad, commonly set at 0.9
betaa2: decay rate for the 2nd moment, sum of grad squared, commonly set at 0.999

Question 20

Q

adam advantages?

Answer

Study These Flashcards

A

adam gets the speed from momentum and the ability to adapt gradients in different directions from rmsprop –> powerful

Question 21

Q

why mini batch maybe faster than batch?

Answer

Study These Flashcards

A

Although there are more steps in 1 epoch,
typically, especially in the early training, param-gradients for diff subsets of data will point in the same direction –> gradients evaluated just on a subsets will roughly point in the same direction of the whole set but requires less computation
convergence on a highly nonlinear deep network typically requires thousands/millions of iterations no matter how good the gradients are –> makes sense to do many updates based on cheap estimates of grads rather than a few updates based on the good ones

Question 22

Q

Why gradient points in the direction of steepest increase?

Answer

Study These Flashcards

A

gradient is the derivative of loss function with respect to weights and biases so it gives the direction of the steepest ascent at that point.

optimization algos Flashcards

(22 cards)