2 Flashcards
What are 3 Gradient Descent optimization’s scenarios issues?
Local optimum
Plateau
Saddle point
What happens in the Local Optimum of Gradient descent?
The optimization is stuck in semi-good solution and cannot escape it and likelihood decreases
What happens in the Plateau of Gradient descent?
The gradient oscillates (doesn’t decrease much)
What happens in the Saddle Point of Gradient descent?
A point on the surface of the loss function where the gradient is zero, but it is not a minimum or maximum. Instead, it is a point where the slope goes up in some directions and down in others, resembling a “saddle” shape. In other words, a saddle point is a critical point that is neither a local minimum nor a local maximum
How can we overcome Gradient Descent optimization’s scenarios issues?
By adjusting momentum and learning rates accordingly
Define momentum
Technique used to accelerate gradient-based optimization by incorporating information from previous steps to smooth out updates and improve convergence
How momentum works?
Momentum helps the optimizer maintain its direction and speed by using a moving average of past gradients. This means that rather than only using the current gradient to update the model parameters, momentum adds a fraction of the previous update (velocity) to the current step
What are the benefits of momentum?
- Faster convergence: Momentum helps accelerate the optimizer in directions with consistent gradients, allowing it to converge more quickly.
- Smoothing out updates: It reduces oscillations, especially in areas with noisy or rapidly changing gradients, leading to more stable updates.
- Escaping local minima and saddle points: Momentum helps the optimizer build speed and continue moving, which can help escape shallow local minima and saddle points where the gradient is close to zero.
What are the downsides of momentum?
- Overshooting - momentum is high and can overshoot the minimum
- Difficult to tune up the right variable
- Sensitive to Learning rate
Name optimizers (optimization algorithms)
Gradient Descent
RMSProp
AdaGrad
Adam
Stochastic Gradient Descent
How RMSProp (optimization algorithm) works?
Improves gradient descent by addressing the issues of vanishing or exploding gradients during training. It adjusts the learning rate for each parameter based on how the gradient behaves over time, making it particularly effective in handling noisy gradients and improving convergence in deep networks
How AdaGrad (optimization algorithm) works?
AdaGrad adjusts the learning rate for each parameter based on the sum of all past gradients. It gives more frequent updates to parameters with infrequent updates and smaller updates to frequently updated parameters
What are the differences between RMSProp and AdaGard?
Gradient accumulation: AdaGrad sums up all past squared gradients while RMSProp uses exponential moving average
Learning rate decay: AdaGrad - decreases rapidly, RMSProp - decays more smoothly
Adaptation: AdaGrad - works well for sparse data, RMSProp - better for more non-stationary problems
How Adam (optimization algorithm) works?
Combines RMSProp with Momentum. Adam adapts the learning rate for each parameter by using both the first moment (momentum) and second moment (normalization) of the gradients
What are Advantages of Adam (optimization algorithm)?
Fast Convergence: Due to its adaptive learning rates and momentum, Adam typically converges faster than other optimizers
Handles Sparse Gradients: Adam works well with sparse gradients because it adapts the learning rate based on past gradients
How Stochastic Gradient Descent (optimization algorithm) works?
Sample minibatch from the dataset
Find a gradient that decreases the loss function from the minibatch
Take a small step in that direction
Name activation functions
Sigmoid
Hypertangent
ReLU
Leaky ReLU
Softmax
What are advantages and disadvantages of Sigmoid (activation function)?
Advantages: perfect for binary classification, either 0 or 1
Disadvantages: Complex, not 0-centric, gradient saturation
What are advantages and disadvantages of Hypertangent (activation function)?
Advantages: stronger gradients than Sigmoid, either 0 or 1, 0-centric
Disadvantages: Complex, gradient saturation
What are advantages and disadvantages of ReLU (activation function)?
Advantages: quick to calculate, if input > 0 - no gradient saturation
Disadvantages: input <0 - gradients die, not 0-centric
What are advantages and disadvantages of Leaky ReLU (activation function)?
Advantages: no dying gradients
Disadvantages: not better than other activation functions
What are advantages and disadvantages of Softmax (activation function)?
Advantages: great for multi-class clasification, normalized outputs, interpretable
Disadvantages: complex, non-sparse outputs, vanishing gradient
How Batch normalization works?
Calculating mean and variance, normalizing inputs, scale and shift parameters