optimization algos Flashcards
what is batch and mini batch gradient descent?
if data too large in number of samples –> split into minibatches
batch: entire training set at the same time
minibatch: 1 smaller batch at a time. each step of training we only process 1 minibatch, update weight then move on to the next batch
is minibatch faster
when training set is large, minibatch runs much faster
difference in cost during training btw batch and mini batch
batch: only goes down
mini batch: fluctuate w a downward trend
different batch sizes?
batch grad des: size = m smooth but take long
stochastic gd: size = 1 can be xtremely noisy but fast (can use small alpha) but will lose speedup from variation
minibatch: some size in between fastest learning but also make progress (usually works better than the other 2)
how to choose minibatch size?
small training: batch gd
typical size: 64, 128, 256, 512 (common)
1024
make sure it fits in GPU or CPU memory
goal of gradient descent
minimize the loss function for a ML prob
idea of basic GD algo
the opposite direction of the gradient points to where the lower area is –> iteratively take steps in the opposite direction of the gradients
vanilla GD algo
delta = -learning_rate * gradient
theta += delta
theta is the param we want to optimize
momentum algo?
in each step, in addition to the regular gradient, it also adds on the movement from the prev step
sum_of_gradient = gradient + prev_sum_of_gradient×decay_rate
delta = -lr×sum_of_gradient
theta += delta
what if decay rate = 0? = 1? what are the reasonable decay rate?
decay rate = 0 –> vanilla GD (no momentum)
decay rate = 1 –> cost goes back and forth endlessly (given reasonably small lr)
typically 0.8 0.9 (surface with a little bit of friction so it eventually slows down and stops)
how is momentum better than vanilla GD?
simply moves faster (all the momentum it accumulates)
has a shot at escaping local minima (momentum may propel it out of a local minimum)
AdaGrad (adaptive gradient) algo
adagrad keeps track of the sum of gradient squared
sum_of_gradient_squared = prev_sum_of_gradient_squared + gradient^2
delta = -lr*gradient/sqrt(sum_of_gradient_squared)
theta += delta
idea of AdaGrad
in ml optimization, some features are very sparse –> avg gradient for sparse features is usually small so such fts get trained at a much slower rate
adagrad: the more you have updated a feature alr, the less you will update it in the future –> sparse fts can catch up
–> escape saddle point faster
what are some gradient squared based methods
adagrad, rmsprop, adam
gradient squared based methods vs vanilla GD?
vanilla: slide down the steep slope first then maybe worry ab the slower direction later
gsb methods: slow down the fast learning features so that sparse features will catch up –> update all fts similarly at a step