DL-02 - Improving DNN Flashcards
DL-02 - Improving DNN
What are 3 commonly used DL approaches to avoid overfitting? (3)
- regularization
- dropout
- early stopping
DL-02 - Improving DNN
What are two common regularization techniques?
- L1
- L2
DL-02 - Improving DNN
What is another name for L1 regularization?
Lasso
DL-02 - Improving DNN
What is another name for Lasso regularization?
L1
DL-02 - Improving DNN
What is another name for L2 regularization?
Ridge
DL-02 - Improving DNN
What is another name for Ridge regularization?
L2
DL-02 - Improving DNN
What are the basic types of learning rate decay? (5)
- Common method
- Exponential
- Epoch number based
- Mini-batch number based
- Discrete staircase
DL-02 - Improving DNN
What is the formula for “common method” learning rate decay?
(See image)
DL-02 - Improving DNN
What is the name for this method of learning rate decay? (See image)
“Common method” learning rate decay
DL-02 - Improving DNN
What is the formula for exponential learning rate decay?
(See image)
DL-02 - Improving DNN
What is the name for this method of learning rate decay? (See image)
Exponential learning rate decay
DL-02 - Improving DNN
What is the formula for “epoch number based” learning rate decay?
(See image)
DL-02 - Improving DNN
What is the name for this method of learning rate decay? (See image)
“epoch number based” learning rate decay
DL-02 - Improving DNN
What is the formula for “mini-batch number based” learning rate decay?
(See image; should say “mini batch number”)
DL-02 - Improving DNN
What is the name for this method of learning rate decay? (See image; should say “mini batch number”)
mini-batch number based
DL-02 - Improving DNN
What is the name of this method of learning rate decay? (See image)
Discrete staircase.
DL-02 - Improving DNN
What does the learning rate graph look like with a “discrete staircase” approach?
(See image)
DL-02 - Improving DNN
What is the role of momentum in neural network training?
Momentum reduces oscillations and speeds up convergence for a smoother learning process.
DL-02 - Improving DNN
How does AdaGrad help improve the performance of a neural network?
By adaptively scaling learning rates based on accumulated past gradients for each parameter, leading to faster convergence.
DL-02 - Improving DNN
Which optimization algorithm is AdaDelta built on?
AdaGrad
DL-02 - Improving DNN
What’s the difference between AdaDelta and AdaGrad?
AdaDelta improves AdaGrad by addressing its diminishing learning rates issue through using a moving average of squared gradients instead of historical accumulated gradients.
DL-02 - Improving DNN
How does RMSprop work?
RMSprop works by adapting the learning rate for each weight parameter using a running average of the magnitude of recent gradients.
DL-02 - Improving DNN
How does the Adam optimizer work?
The Adam optimizer works by adaptively adjusting learning rates for each parameter using both moment estimates and exponentially-averaged past gradients.
DL-02 - Improving DNN
What are some benefits of using ADAM over other optimizers? (AESN)
- Adaptive learning rates
- efficient computation
- suitable for sparse data
- reduced noise in parameter updates
DL-02 - Improving DNN
Which optimizer performs the best on average?
ADAM.
DL-02 - Improving DNN
What is a requirement for using batch normalization?
Decently large mini-batches. Small batches make it unstable.
DL-02 - Improving DNN
How does batch normalization act as a regularizer?
It adds noise during training.
DL-02 - Improving DNN
How does batch and layer normalization differ?
- BN normalizes batches, e.g. same params for a whole image.
- LN normalizes layers/features separately, but also inside the batch.
DL-02 - Improving DNN
Is batch normalization stable for small mini-batch sizes?
No, need large batch sizes.
DL-02 - Improving DNN
Is layer normalization stable for small mini-batch sizes?
Yes, it’s not dependent on batch size.
DL-02 - Improving DNN
What are the 3 types of hyper parameter tuning approaches? (3)
- Manual
- Brute force (Grid search, random search etc.)
- Meta model (Machine learning, e.g. Optuna)
DL-02 - Improving DNN
What is HPO short for?
Hyperparameter optimization
DL-02 - Improving DNN
What is a surrogate model in therms of HPO?
A model trained on the hyper parameters, where the model output is model quality.
DL-02 - Improving DNN
What is a requirement for choosing what model to use for a surrogate model?
No way to know gradient -> model should use gradient free optimization.
DL-02 - Improving DNN
When should you retune hyperparameters?
Occasionally/regularly, especially with a change in the data or problem to solve.
DL-02 - Improving DNN
What is NAG short for?
Nesterov accelerated gradient
DL-02 - Improving DNN
What is the purpose of Nesterov momentum in neural network training?
Nesterov momentum accelerates training by computing future position gradients, reducing oscillations and improving convergence.
DL-02 - Improving DNN
What is RMSProp short for?
Root Mean Square Propagation