C2W2 Optimization algorithms Flashcards
Batch
Learning through all the data at the same time
Mini-batch
Splitting the data (the power of 2 - 64, 128, 256, 512, 1024)
Exponentially weighted average
Parameter B (beta), values 0.9 (10 values average), 0.98 (30 values average)
Bias correction
Is needed in Exponentially weighted average to smooth incorrect value in the beginning
Gradient descent with momentum
Is needed to slower learning in incorrect dimensions. It’s an optimization algorithm. Input parameter Beta
In decays learning rate for the problematic direction.
Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
You have to tune a momentum hyperparameter 𝛽
and a learning rate 𝛼
RMSProp
Root mean square propagation. Optimization algorithm. Input parameter Beta2. Similar to the momentum optimization
ADAM
Adam optimization algorithm is basically taking momentum and RMSprop, and putting them together.
Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
Usually works well even with little tuning of hyperparameters (except 𝛼)
Best values for Beta1, Beta2
Beta1 = 0.9, Beta2 = 0.999
Learning rate decay
Helps to reduce noise when gradient descent coming to the optimum, by reducing learning rate as epoch increases
Local optimum
Mostly not applicable to the many-dimensional spaces, but may stumble upon plateaus, optimizations algorithm help to overcome this.
Fixed interval scheduling
Decaying learning rate every few steps (epochs)
Three important optimization techniques
Apply three different optimization methods to your models
Build mini-batches for your training set
Use learning rate decay scheduling to speed up your training
On which phase do optimisation algorithms(ADAM, RMSProp, Momentum) work?
The work during backward propagation by modifying the gradients “update” routine