2 Flashcards
What are 3 Gradient Descent optimization’s scenarios issues?
Local optimum
Plateau
Saddle point
What happens in the Local Optimum of Gradient descent?
The optimization is stuck in semi-good solution and cannot escape it and likelihood decreases
What happens in the Plateau of Gradient descent?
The gradient oscillates (doesn’t decrease much)
What happens in the Saddle Point of Gradient descent?
A point on the surface of the loss function where the gradient is zero, but it is not a minimum or maximum. Instead, it is a point where the slope goes up in some directions and down in others, resembling a “saddle” shape. In other words, a saddle point is a critical point that is neither a local minimum nor a local maximum
How can we overcome Gradient Descent optimization’s scenarios issues?
By adjusting momentum and learning rates accordingly
Define momentum
Technique used to accelerate gradient-based optimization by incorporating information from previous steps to smooth out updates and improve convergence
How momentum works?
Momentum helps the optimizer maintain its direction and speed by using a moving average of past gradients. This means that rather than only using the current gradient to update the model parameters, momentum adds a fraction of the previous update (velocity) to the current step
What are the benefits of momentum?
- Faster convergence: Momentum helps accelerate the optimizer in directions with consistent gradients, allowing it to converge more quickly.
- Smoothing out updates: It reduces oscillations, especially in areas with noisy or rapidly changing gradients, leading to more stable updates.
- Escaping local minima and saddle points: Momentum helps the optimizer build speed and continue moving, which can help escape shallow local minima and saddle points where the gradient is close to zero.
What are the downsides of momentum?
- Overshooting - momentum is high and can overshoot the minimum
- Difficult to tune up the right variable
- Sensitive to Learning rate
Name optimizers (optimization algorithms)
Gradient Descent
RMSProp
AdaGrad
Adam
Stochastic Gradient Descent
How RMSProp (optimization algorithm) works?
Improves gradient descent by addressing the issues of vanishing or exploding gradients during training. It adjusts the learning rate for each parameter based on how the gradient behaves over time, making it particularly effective in handling noisy gradients and improving convergence in deep networks
How AdaGrad (optimization algorithm) works?
AdaGrad adjusts the learning rate for each parameter based on the sum of all past gradients. It gives more frequent updates to parameters with infrequent updates and smaller updates to frequently updated parameters
What are the differences between RMSProp and AdaGard?
Gradient accumulation: AdaGrad sums up all past squared gradients while RMSProp uses exponential moving average
Learning rate decay: AdaGrad - decreases rapidly, RMSProp - decays more smoothly
Adaptation: AdaGrad - works well for sparse data, RMSProp - better for more non-stationary problems
How Adam (optimization algorithm) works?
Combines RMSProp with Momentum. Adam adapts the learning rate for each parameter by using both the first moment (momentum) and second moment (normalization) of the gradients
What are Advantages of Adam (optimization algorithm)?
Fast Convergence: Due to its adaptive learning rates and momentum, Adam typically converges faster than other optimizers
Handles Sparse Gradients: Adam works well with sparse gradients because it adapts the learning rate based on past gradients