optimization algos Flashcards
what is batch and mini batch gradient descent?
if data too large in number of samples –> split into minibatches
batch: entire training set at the same time
minibatch: 1 smaller batch at a time. each step of training we only process 1 minibatch, update weight then move on to the next batch
is minibatch faster
when training set is large, minibatch runs much faster
difference in cost during training btw batch and mini batch
batch: only goes down
mini batch: fluctuate w a downward trend
different batch sizes?
batch grad des: size = m smooth but take long
stochastic gd: size = 1 can be xtremely noisy but fast (can use small alpha) but will lose speedup from variation
minibatch: some size in between fastest learning but also make progress (usually works better than the other 2)
how to choose minibatch size?
small training: batch gd
typical size: 64, 128, 256, 512 (common)
1024
make sure it fits in GPU or CPU memory
goal of gradient descent
minimize the loss function for a ML prob
idea of basic GD algo
the opposite direction of the gradient points to where the lower area is –> iteratively take steps in the opposite direction of the gradients
vanilla GD algo
delta = -learning_rate * gradient
theta += delta
theta is the param we want to optimize
momentum algo?
in each step, in addition to the regular gradient, it also adds on the movement from the prev step
sum_of_gradient = gradient + prev_sum_of_gradient×decay_rate
delta = -lr×sum_of_gradient
theta += delta
what if decay rate = 0? = 1? what are the reasonable decay rate?
decay rate = 0 –> vanilla GD (no momentum)
decay rate = 1 –> cost goes back and forth endlessly (given reasonably small lr)
typically 0.8 0.9 (surface with a little bit of friction so it eventually slows down and stops)
how is momentum better than vanilla GD?
simply moves faster (all the momentum it accumulates)
has a shot at escaping local minima (momentum may propel it out of a local minimum)
AdaGrad (adaptive gradient) algo
adagrad keeps track of the sum of gradient squared
sum_of_gradient_squared = prev_sum_of_gradient_squared + gradient^2
delta = -lr*gradient/sqrt(sum_of_gradient_squared)
theta += delta
idea of AdaGrad
in ml optimization, some features are very sparse –> avg gradient for sparse features is usually small so such fts get trained at a much slower rate
adagrad: the more you have updated a feature alr, the less you will update it in the future –> sparse fts can catch up
–> escape saddle point faster
what are some gradient squared based methods
adagrad, rmsprop, adam
gradient squared based methods vs vanilla GD?
vanilla: slide down the steep slope first then maybe worry ab the slower direction later
gsb methods: slow down the fast learning features so that sparse features will catch up –> update all fts similarly at a step
drawbacks of adagrad?
very slow because the sum of gradient squared only grows and nvr shrinks
RMSProp (Root Mean Square Propagation) algo?
sum_of_gradient_squared = prev_sum_of_gradient_squared×decay_rate + gradient^2×(1-decay_rate)
delta = -lr×gradient/sqrt(sum_of_gradient_squared)
theta += delta
how rmsprop tackle adagrad issue? (the idea of decay rate)
it introduces decay rate –> sum of gradient squared actually the decayed sum of gradient squared.
decay rate is saying that only recent gradient matters, the ones from long ago will be forgotten
in addition to decaying the prev sum of gs, it also scales down the whole sum of gs by factor of 1-decay_rate –> step is larger –> eventually faster than adagrad
adam (adaptive moment estimation) algo?
take the best of momentum and rmsprop
Momentum: sum_of_grad=prev_sum_of_grad×beta1+grad×(1-beta1)
RMSprop: sum_of_grad_squared = prev_sum_of_grad_squared×beta2+grad^2×*1-beta2)
delta = -lr×sum_of_grad/sqrt(sum_of_grad_squared)
theta += delta
beta1: decay rate for the first moment, sum of grad, commonly set at 0.9
betaa2: decay rate for the 2nd moment, sum of grad squared, commonly set at 0.999
adam advantages?
adam gets the speed from momentum and the ability to adapt gradients in different directions from rmsprop –> powerful
why mini batch maybe faster than batch?
Although there are more steps in 1 epoch,
typically, especially in the early training, param-gradients for diff subsets of data will point in the same direction –> gradients evaluated just on a subsets will roughly point in the same direction of the whole set but requires less computation
convergence on a highly nonlinear deep network typically requires thousands/millions of iterations no matter how good the gradients are –> makes sense to do many updates based on cheap estimates of grads rather than a few updates based on the good ones
Why gradient points in the direction of steepest increase?
gradient is the derivative of loss function with respect to weights and biases so it gives the direction of the steepest ascent at that point.