Optimization Flashcards
What are the hyperparameters in a gradient descent?
Weight initialization method
Number of steps before the algo stops
Learning rate per iterations
What is batch gradient descent?
Batch gradient descent updates the model after the loss for each example in the training set is computed
What is stochastic gradient descent?
Approximate sum of loss function using a minibatch of examples 32/64/128 common
What new hypermeters arising from SGD?
Batch size and data sampling. Batch size does not matter too much. Make it as big as it can fit your hardware.
Data sampling : Draws from random. Does not matter too much as well esp for computer vision.
What is high condition number?
High condition number usually means that the matrix is almost non-invertible. For SGD, it may lead to jitter and slow progress.
What is problem with SGD?
It is stuck at local minimum or saddle point in which gradient is zero. Saddle point in which in some direction it is increasing and another direction it is decreasing.
What other problem with SGD?
Gradient comes from minibatches so they can be noisy