Regularization & Optimizers Flashcards

1
Q

Why does L2 regularization shrink weights instead of setting them to zero?

A

L2 regularization penalizes the squared magnitude of weights, shrinking them proportionally without forcing them to zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does L1 regularization change the gradient of the loss function?

A

L1 regularization adds the term λ * sign(w) to the gradient, pushing weights toward zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Dropout?

A

Dropout is a regularization technique that randomly sets a fraction of neurons to zero during training, reducing reliance on specific neurons and improving generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Data Augmentation?

A

Data Augmentation involves creating variations of training data (e.g., rotations, flips, noise) to improve model generalization and robustness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Early Stopping?

A

Early Stopping monitors validation performance during training and stops training when the performance no longer improves, preventing overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Batch Normalization?

A

Batch Normalization normalizes layer inputs within a mini-batch, stabilizing training and acting as a form of regularization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Weight Decay?

A

Weight Decay applies a sort of L2 regularization during training, penalizing large weights to improve generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Noise Injection?

A

Noise Injection adds noise to inputs, weights, or activations during training to make the model more robust and prevent overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is SGD (Stochastic Gradient Descent)?

A

SGD updates model weights based on a small random subset of data (a mini-batch), improving training efficiency and convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Momentum in optimization?

A

Momentum is an extension of SGD that accelerates training by using a fraction of the previous weight update in the current update.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is RMSprop?

A

RMSprop adapts the learning rate for each parameter by dividing the gradient by a moving average of recent squared gradients, peanalising bigger gradients at points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Adam Optimizer?

A

Adam combines the advantages of Momentum and RMSprop, using adaptive learning rates and momentum for faster and more stable convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do L1 and L2 regularization differ in their impact on sparsity?

A

L1 regularization creates sparsity by driving some weights to zero, while L2 shrinks weights without eliminating them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When would you prefer L1 regularization over L2 regularization?

A

L1 is preferred for sparse data or when feature selection is needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is L0 regularization difficult to optimize?

A

L0 regularization is non-differentiable and discontinuous, making gradient-based optimization infeasible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the primary difference between SGD and Adam?

A

Adam uses momentum and adaptive learning rates, while SGD applies a fixed learning rate without momentum.

17
Q

What are the key benefits of momentum-based optimizers over SGD?

A

Momentum reduces oscillations and speeds up convergence by smoothing updates.

18
Q

When is a learning rate scheduler useful?

A

A learning rate scheduler is useful when training plateaus or oscillates to stabilize convergence.

19
Q

How do Adam and RMSprop differ in adjusting learning rates, and why is Adam often preferred?

A

Adam builds on this by adding momentum ( 𝑚 𝑡 ) for gradient smoothing and bias correction for both moments, making it more robust across diverse tasks.