Regularization & Optimizers Flashcards
Why does L2 regularization shrink weights instead of setting them to zero?
L2 regularization penalizes the squared magnitude of weights, shrinking them proportionally without forcing them to zero.
How does L1 regularization change the gradient of the loss function?
L1 regularization adds the term λ * sign(w) to the gradient, pushing weights toward zero.
What is Dropout?
Dropout is a regularization technique that randomly sets a fraction of neurons to zero during training, reducing reliance on specific neurons and improving generalization.
What is Data Augmentation?
Data Augmentation involves creating variations of training data (e.g., rotations, flips, noise) to improve model generalization and robustness.
What is Early Stopping?
Early Stopping monitors validation performance during training and stops training when the performance no longer improves, preventing overfitting.
What is Batch Normalization?
Batch Normalization normalizes layer inputs within a mini-batch, stabilizing training and acting as a form of regularization.
What is Weight Decay?
Weight Decay applies a sort of L2 regularization during training, penalizing large weights to improve generalization.
What is Noise Injection?
Noise Injection adds noise to inputs, weights, or activations during training to make the model more robust and prevent overfitting.
What is SGD (Stochastic Gradient Descent)?
SGD updates model weights based on a small random subset of data (a mini-batch), improving training efficiency and convergence.
What is Momentum in optimization?
Momentum is an extension of SGD that accelerates training by using a fraction of the previous weight update in the current update.
What is RMSprop?
RMSprop adapts the learning rate for each parameter by dividing the gradient by a moving average of recent squared gradients, peanalising bigger gradients at points.
What is Adam Optimizer?
Adam combines the advantages of Momentum and RMSprop, using adaptive learning rates and momentum for faster and more stable convergence.
How do L1 and L2 regularization differ in their impact on sparsity?
L1 regularization creates sparsity by driving some weights to zero, while L2 shrinks weights without eliminating them.
When would you prefer L1 regularization over L2 regularization?
L1 is preferred for sparse data or when feature selection is needed.
Why is L0 regularization difficult to optimize?
L0 regularization is non-differentiable and discontinuous, making gradient-based optimization infeasible.