Class Thirteen Flashcards
What are optimizers in the context of deep neural networks?
Optimizers are algorithms used to update the parameters of deep neural networks during the training process. They aim to minimize the loss function and find the optimal set of weights that result in the best model performance.
What is the role of an optimizer in deep learning?
The role of an optimizer is to iteratively adjust the model parameters based on the gradients of the loss function. This helps the model converge towards the global or local minimum of the loss function, improving its performance.
What are some challenges in training deep neural networks with traditional optimizers?
Challenges in training deep neural networks with traditional optimizers include slow convergence, vanishing/exploding gradients, and difficulty in escaping local minima.
How can you improve speed of training?
- Using good initialization strategy for connection weights
- Using a good activation function
- Using Batch Normalization
- Reusing parts of a pretrained network
- Another way: Use faster optimizer than regular Gradient Descent optimizer.
- Popular faster optimization algorithms:
- Momentum optimization
- Nesterov Accelerated Gradient
What is Stochastic Gradient Descent (SGD)?
Stochastic Gradient Descent (SGD) is a basic optimization algorithm widely used in deep learning. It updates the model parameters based on the gradients of a randomly sampled subset of training examples.
What are the limitations of Stochastic Gradient Descent (SGD)?
Limitations of SGD include its sensitivity to the learning rate and the need for careful tuning, as well as slow convergence and difficulty in handling noisy or ill-conditioned data.
What is Momentum optimization?
Momentum optimization is an extension of SGD that accumulates a moving average of past gradients to accelerate convergence. It helps the optimizer navigate flat areas of the loss function and escape local minima.
What are the advantages of using Momentum optimization?
Advantages of Momentum optimization include faster convergence, improved resistance to noise, and enhanced ability to escape shallow local minima.
What is the Adam optimizer?
Adam (Adaptive Moment Estimation) is an advanced optimization algorithm that combines features of both Momentum optimization and RMSProp. It adapts the learning rate for each parameter based on their gradient history.
What are the benefits of using the Adam optimizer?
Benefits of using the Adam optimizer include faster convergence, robustness to different learning rates, and automatic adjustment of the learning rate based on the estimated gradient moments.
What is the Nesterov Accelerated Gradient?
Nesterov Accelerated Gradient (NAG) is an optimization algorithm that builds upon the traditional momentum-based optimization method by introducing a “look-ahead” mechanism. In NAG, instead of relying solely on the accumulated past gradients to update the parameters, it considers a future position of the parameters based on the current momentum. This is achieved by first computing a partial update using the current momentum and then evaluating the gradient at this partially updated position. The momentum is adjusted based on this evaluated gradient and is then used to compute the final parameter update. By incorporating this “look-ahead” effect, NAG can make more informed and accurate updates, leading to faster convergence and improved optimization performance. NAG is particularly effective in scenarios where the optimization landscape is complex and finding optimal solutions is challenging, such as training deep neural networks.
What is the learning rate schedule?
The learning rate schedule refers to the strategy used to adjust the learning rate during the training process. It can be fixed, decayed over time, or dynamically adjusted based on the model’s performance.
What is the purpose of learning rate decay?
Learning rate decay gradually reduces the learning rate over time to fine-tune the model’s parameters and facilitate convergence. It helps prevent overshooting the optimal solution.
What are the different learning schedules?
Strategies to reduce learning rate during training called learning schedules.
1. Power scheduling: Set learning rate to a function of iteration number t: η(t) = η0 /
(1+t/s)c
. (initial learning rate η0, power c (set to 1), steps s are hyperparameters).
2. Exponential scheduling: Set learning rate to η(t) = η0 0.1t/s. Learning rate will
gradually drop by a factor of 10 every s steps.
3. Piecewise constant scheduling: Use a constant learning rate for a number of epochs
(e.g., η0 = 0.1 for 5 epochs), then smaller learning rate for another number of epochs
(e.g., η1 = 0.001 for 50 epochs), and so on.
4. Performance scheduling: Measure validation error every N steps (like early stopping)
and reduce learning rate by a factor of λ when error stops dropping.
What are adaptive learning rate methods?
Adaptive learning rate methods dynamically adjust the learning rate during training based on the gradients and other factors. Examples include AdaGrad, RMSProp, and Adam.