Regularization & Optimizers Flashcards

Question 1

Q

Why does L2 regularization shrink weights instead of setting them to zero?

Answer

A

L2 regularization penalizes the squared magnitude of weights, shrinking them proportionally without forcing them to zero.

Question 2

Q

How does L1 regularization change the gradient of the loss function?

Answer

A

L1 regularization adds the term λ * sign(w) to the gradient, pushing weights toward zero.

Question 3

Q

What is Dropout?

Answer

A

Dropout is a regularization technique that randomly sets a fraction of neurons to zero during training, reducing reliance on specific neurons and improving generalization.

Question 4

Q

What is Data Augmentation?

Answer

A

Data Augmentation involves creating variations of training data (e.g., rotations, flips, noise) to improve model generalization and robustness.

Question 5

Q

What is Early Stopping?

Answer

A

Early Stopping monitors validation performance during training and stops training when the performance no longer improves, preventing overfitting.

Question 6

Q

What is Batch Normalization?

Answer

A

Batch Normalization normalizes layer inputs within a mini-batch, stabilizing training and acting as a form of regularization.

Question 7

Q

What is Weight Decay?

Answer

A

Weight Decay applies a sort of L2 regularization during training, penalizing large weights to improve generalization.

Question 8

Q

What is Noise Injection?

Answer

A

Noise Injection adds noise to inputs, weights, or activations during training to make the model more robust and prevent overfitting.

Question 9

Q

What is SGD (Stochastic Gradient Descent)?

Answer

A

SGD updates model weights based on a small random subset of data (a mini-batch), improving training efficiency and convergence.

Question 10

Q

What is Momentum in optimization?

Answer

A

Momentum is an extension of SGD that accelerates training by using a fraction of the previous weight update in the current update.

Question 11

Q

What is RMSprop?

Answer

A

RMSprop adapts the learning rate for each parameter by dividing the gradient by a moving average of recent squared gradients, peanalising bigger gradients at points.

Question 12

Q

What is Adam Optimizer?

Answer

A

Adam combines the advantages of Momentum and RMSprop, using adaptive learning rates and momentum for faster and more stable convergence.

Question 13

Q

How do L1 and L2 regularization differ in their impact on sparsity?

Answer

A

L1 regularization creates sparsity by driving some weights to zero, while L2 shrinks weights without eliminating them.

Question 14

Q

When would you prefer L1 regularization over L2 regularization?

Answer

A

L1 is preferred for sparse data or when feature selection is needed.

Question 15

Q

Why is L0 regularization difficult to optimize?

Answer

A

L0 regularization is non-differentiable and discontinuous, making gradient-based optimization infeasible.

Question 16

Q

What is the primary difference between SGD and Adam?

Answer

Study These Flashcards

A

Adam uses momentum and adaptive learning rates, while SGD applies a fixed learning rate without momentum.

Question 17

Q

What are the key benefits of momentum-based optimizers over SGD?

Answer

Study These Flashcards

A

Momentum reduces oscillations and speeds up convergence by smoothing updates.

Question 18

Q

When is a learning rate scheduler useful?

Answer

Study These Flashcards

A

A learning rate scheduler is useful when training plateaus or oscillates to stabilize convergence.

Question 19

Q

How do Adam and RMSprop differ in adjusting learning rates, and why is Adam often preferred?

Answer

Study These Flashcards

A

Adam builds on this by adding momentum ( 𝑚 𝑡 ) for gradient smoothing and bias correction for both moments, making it more robust across diverse tasks.

Regularization & Optimizers Flashcards

(19 cards)