2 Flashcards

1
Q

What are 3 Gradient Descent optimization’s scenarios issues?

A

Local optimum
Plateau
Saddle point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What happens in the Local Optimum of Gradient descent?

A

The optimization is stuck in semi-good solution and cannot escape it and likelihood decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What happens in the Plateau of Gradient descent?

A

The gradient oscillates (doesn’t decrease much)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What happens in the Saddle Point of Gradient descent?

A

A point on the surface of the loss function where the gradient is zero, but it is not a minimum or maximum. Instead, it is a point where the slope goes up in some directions and down in others, resembling a “saddle” shape. In other words, a saddle point is a critical point that is neither a local minimum nor a local maximum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can we overcome Gradient Descent optimization’s scenarios issues?

A

By adjusting momentum and learning rates accordingly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define momentum

A

Technique used to accelerate gradient-based optimization by incorporating information from previous steps to smooth out updates and improve convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How momentum works?

A

Momentum helps the optimizer maintain its direction and speed by using a moving average of past gradients. This means that rather than only using the current gradient to update the model parameters, momentum adds a fraction of the previous update (velocity) to the current step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the benefits of momentum?

A
  1. Faster convergence: Momentum helps accelerate the optimizer in directions with consistent gradients, allowing it to converge more quickly.
  2. Smoothing out updates: It reduces oscillations, especially in areas with noisy or rapidly changing gradients, leading to more stable updates.
  3. Escaping local minima and saddle points: Momentum helps the optimizer build speed and continue moving, which can help escape shallow local minima and saddle points where the gradient is close to zero.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the downsides of momentum?

A
  1. Overshooting - momentum is high and can overshoot the minimum
  2. Difficult to tune up the right variable
  3. Sensitive to Learning rate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name optimizers (optimization algorithms)

A

Gradient Descent
RMSProp
AdaGrad
Adam
Stochastic Gradient Descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How RMSProp (optimization algorithm) works?

A

Improves gradient descent by addressing the issues of vanishing or exploding gradients during training. It adjusts the learning rate for each parameter based on how the gradient behaves over time, making it particularly effective in handling noisy gradients and improving convergence in deep networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How AdaGrad (optimization algorithm) works?

A

AdaGrad adjusts the learning rate for each parameter based on the sum of all past gradients. It gives more frequent updates to parameters with infrequent updates and smaller updates to frequently updated parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the differences between RMSProp and AdaGard?

A

Gradient accumulation: AdaGrad sums up all past squared gradients while RMSProp uses exponential moving average
Learning rate decay: AdaGrad - decreases rapidly, RMSProp - decays more smoothly
Adaptation: AdaGrad - works well for sparse data, RMSProp - better for more non-stationary problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How Adam (optimization algorithm) works?

A

Combines RMSProp with Momentum. Adam adapts the learning rate for each parameter by using both the first moment (momentum) and second moment (normalization) of the gradients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are Advantages of Adam (optimization algorithm)?

A

Fast Convergence: Due to its adaptive learning rates and momentum, Adam typically converges faster than other optimizers
Handles Sparse Gradients: Adam works well with sparse gradients because it adapts the learning rate based on past gradients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How Stochastic Gradient Descent (optimization algorithm) works?

A

Sample minibatch from the dataset
Find a gradient that decreases the loss function from the minibatch
Take a small step in that direction

17
Q

Name activation functions

A

Sigmoid
Hypertangent
ReLU
Leaky ReLU
Softmax

18
Q

What are advantages and disadvantages of Sigmoid (activation function)?

A

Advantages: perfect for binary classification, either 0 or 1
Disadvantages: Complex, not 0-centric, gradient saturation

19
Q

What are advantages and disadvantages of Hypertangent (activation function)?

A

Advantages: stronger gradients than Sigmoid, either 0 or 1, 0-centric
Disadvantages: Complex, gradient saturation

20
Q

What are advantages and disadvantages of ReLU (activation function)?

A

Advantages: quick to calculate, if input > 0 - no gradient saturation
Disadvantages: input <0 - gradients die, not 0-centric

21
Q

What are advantages and disadvantages of Leaky ReLU (activation function)?

A

Advantages: no dying gradients
Disadvantages: not better than other activation functions

22
Q

What are advantages and disadvantages of Softmax (activation function)?

A

Advantages: great for multi-class clasification, normalized outputs, interpretable
Disadvantages: complex, non-sparse outputs, vanishing gradient

23
Q

How Batch normalization works?

A

Calculating mean and variance, normalizing inputs, scale and shift parameters