Optimizers Flashcards

Question 1

Q

What is an optimizer?

Answer

A

Optimizers minimize the loss function by tying together the loss function and model parameters by updating the neural network based on the output of the loss function by adjusting the weights and biases

The loss functions guide the optimizers - if you make it to the top of the mountain, you are the neural network, you descending down from the mountain represents minimizing the error, your feet is the loss function as it guides you down the mountain

Question 2

Q

What is a gradient descent?

Answer

A

Iterative algo that starts off at a random point on the loss function and travels down its slope in steps until it reaches the lowest point (minimum) of the function

Is basically what Backpropagation is - Gradient descent implemented on a network

Question 3

Q

How does a gradient descent work?

Answer

A

Calculates what a small change in each individual weight would do to the loss function
Adjusts each parameter based on its gradient i.e take a small step in the determined direction
Repeats steps 1 and 2 until the loss function is as low as possible

Question 4

Q

What is the actual gradient?

Answer

A

The gradient of a function is the vector of partial derivatives with respect to all the independent variables and always points in the direction of the steepest increase in the function

Question 5

Q

What is an approach to avoid getting stuck in a local minima while using a gradient descent?

Answer

A

Use a proper learning rate - usually a small number like .001 that are multiplied to scale the gradients. This ensures that any changes made to the weights are quite small because if you make too big of jumps then you run the risk of skipping over the optimal value for a given weight

Question 6

Q

What is a stochastic gradient descent?

Answer

A

Uses a subset of training examples rather than the entire lot

SGD is an implementation of gradient descent that uses batches on each pass

Uses momentum to accumulate gradients

Less intensive computationally

Question 7

Q

What is an Adagrad

Answer

A

Adapts learning rate to individual features

Some weights will have different learning rates

Ideal for sparse datasets with many input examples missing

Learning rate tends to get small with time

Question 8

Q

RMSprop?

Answer

A

specialized version of Adagrad

Accumulates gradients in a fixed window

Similar to adaprop

Question 9

Q

Adam?

Answer

A

Stands for adapative momentum estimation and is another way of using past gradients to calculate the carbon gradient

Uses the concept of momentum which is a way of telling the neural network whether we want past changes to affect new change by adding fractions of the previous gradients to the current one

Question 10

Q

What is the overall goal of optimizers in neural networks?

Answer

A

Minimizing the loss function

Optimizers Flashcards

(10 cards)