Class Thirteen Flashcards

Question 1

Q

What are optimizers in the context of deep neural networks?

Answer

A

Optimizers are algorithms used to update the parameters of deep neural networks during the training process. They aim to minimize the loss function and find the optimal set of weights that result in the best model performance.

Question 2

Q

What is the role of an optimizer in deep learning?

Answer

A

The role of an optimizer is to iteratively adjust the model parameters based on the gradients of the loss function. This helps the model converge towards the global or local minimum of the loss function, improving its performance.

Question 3

Q

What are some challenges in training deep neural networks with traditional optimizers?

Answer

A

Challenges in training deep neural networks with traditional optimizers include slow convergence, vanishing/exploding gradients, and difficulty in escaping local minima.

Question 4

Q

How can you improve speed of training?

Answer

A

Using good initialization strategy for connection weights
Using a good activation function
Using Batch Normalization
Reusing parts of a pretrained network
Another way: Use faster optimizer than regular Gradient Descent optimizer.
Popular faster optimization algorithms:
- Momentum optimization
- Nesterov Accelerated Gradient

Question 5

Q

What is Stochastic Gradient Descent (SGD)?

Answer

A

Stochastic Gradient Descent (SGD) is a basic optimization algorithm widely used in deep learning. It updates the model parameters based on the gradients of a randomly sampled subset of training examples.

Question 6

Q

What are the limitations of Stochastic Gradient Descent (SGD)?

Answer

A

Limitations of SGD include its sensitivity to the learning rate and the need for careful tuning, as well as slow convergence and difficulty in handling noisy or ill-conditioned data.

Question 7

Q

What is Momentum optimization?

Answer

A

Momentum optimization is an extension of SGD that accumulates a moving average of past gradients to accelerate convergence. It helps the optimizer navigate flat areas of the loss function and escape local minima.

Question 8

Q

What are the advantages of using Momentum optimization?

Answer

A

Advantages of Momentum optimization include faster convergence, improved resistance to noise, and enhanced ability to escape shallow local minima.

Question 9

Q

What is the Adam optimizer?

Answer

A

Adam (Adaptive Moment Estimation) is an advanced optimization algorithm that combines features of both Momentum optimization and RMSProp. It adapts the learning rate for each parameter based on their gradient history.

Question 10

Q

What are the benefits of using the Adam optimizer?

Answer

A

Benefits of using the Adam optimizer include faster convergence, robustness to different learning rates, and automatic adjustment of the learning rate based on the estimated gradient moments.

Question 11

Q

What is the Nesterov Accelerated Gradient?

Answer

A

Nesterov Accelerated Gradient (NAG) is an optimization algorithm that builds upon the traditional momentum-based optimization method by introducing a “look-ahead” mechanism. In NAG, instead of relying solely on the accumulated past gradients to update the parameters, it considers a future position of the parameters based on the current momentum. This is achieved by first computing a partial update using the current momentum and then evaluating the gradient at this partially updated position. The momentum is adjusted based on this evaluated gradient and is then used to compute the final parameter update. By incorporating this “look-ahead” effect, NAG can make more informed and accurate updates, leading to faster convergence and improved optimization performance. NAG is particularly effective in scenarios where the optimization landscape is complex and finding optimal solutions is challenging, such as training deep neural networks.

Question 12

Q

What is the learning rate schedule?

Answer

A

The learning rate schedule refers to the strategy used to adjust the learning rate during the training process. It can be fixed, decayed over time, or dynamically adjusted based on the model’s performance.

Question 13

Q

What is the purpose of learning rate decay?

Answer

A

Learning rate decay gradually reduces the learning rate over time to fine-tune the model’s parameters and facilitate convergence. It helps prevent overshooting the optimal solution.

Question 14

Q

What are the different learning schedules?

Answer

A

Strategies to reduce learning rate during training called learning schedules.
1. Power scheduling: Set learning rate to a function of iteration number t: η(t) = η0 /
(1+t/s)c
. (initial learning rate η0, power c (set to 1), steps s are hyperparameters).
2. Exponential scheduling: Set learning rate to η(t) = η0 0.1t/s. Learning rate will
gradually drop by a factor of 10 every s steps.
3. Piecewise constant scheduling: Use a constant learning rate for a number of epochs
(e.g., η0 = 0.1 for 5 epochs), then smaller learning rate for another number of epochs
(e.g., η1 = 0.001 for 50 epochs), and so on.
4. Performance scheduling: Measure validation error every N steps (like early stopping)
and reduce learning rate by a factor of λ when error stops dropping.

Question 15

Q

What are adaptive learning rate methods?

Answer

A

Adaptive learning rate methods dynamically adjust the learning rate during training based on the gradients and other factors. Examples include AdaGrad, RMSProp, and Adam.

Question 16

Q

What is the advantage of using adaptive learning rate methods?

Answer

Study These Flashcards

A

Adaptive learning rate methods adaptively adjust the learning rate for each parameter, allowing the optimizer to converge faster and potentially improve the model’s generalization performance.

Question 17

Q

What is the concept of learning rate warm-up?

Answer

Study These Flashcards

A

Learning rate warm-up is a technique where the learning rate starts from a small value and gradually increases during the early stages of training. It helps stabilize the training process and prevent large initial weight updates.

Question 18

Q

Can different optimizers have an impact on the final model performance?

Answer

Study These Flashcards

A

Yes, different optimizers can have an impact on the final model performance. The choice of optimizer depends on the specific problem, data, and network architecture, and it may require empirical experimentation to find the most suitable one.

Question 19

Q

What are some different Hyper-parameter Optimization Techniques?

Answer

Study These Flashcards

A

Model-Free Blackbox Optimization Methods:
* Babysitting or ‘Trial and Error’ or Grad student descent (GSD) method is
implemented by 100% manual tuning and widely used.
* Problems due to large number of hyper-parameters, complex models, timeconsuming model evaluations, and non-linear hyper-parameter interactions.
* Grid search (GS) most commonly-used methods
* An exhaustive search or a brute-force method
* Works by evaluating the Cartesian product of user-specified finite set of values
* Problem: Inefficiency for high-dimensionality hyper-parameter configuration
space
* Since number of evaluations increases exponentially to number of hyperparameters growth

Question 20

Q

What is the “black box” problem in ML and DL?

Answer

Study These Flashcards

A

The “black box” problem refers to the lack of interpretability or explainability of ML and DL models. It is challenging to understand how these models arrive at their decisions, which can have ethical implications, especially in critical domains.

Question 21

Q

What is the Ethics of Machine Learning (ML) and Deep Learning (DL)?

Answer

Study These Flashcards

A

The Ethics of ML and DL examines the moral and societal implications of using machine learning and deep learning technologies. It addresses issues of fairness, transparency, accountability, privacy, and bias in AI systems.The main areas concern: privacy and data protection, reliability, transparency, and safety.

Class Thirteen Flashcards

(21 cards)