Optimizers Flashcards

1
Q

What is an optimizer?

A

Optimizers minimize the loss function by tying together the loss function and model parameters by updating the neural network based on the output of the loss function by adjusting the weights and biases

The loss functions guide the optimizers - if you make it to the top of the mountain, you are the neural network, you descending down from the mountain represents minimizing the error, your feet is the loss function as it guides you down the mountain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a gradient descent?

A

Iterative algo that starts off at a random point on the loss function and travels down its slope in steps until it reaches the lowest point (minimum) of the function

Is basically what Backpropagation is - Gradient descent implemented on a network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does a gradient descent work?

A
  1. Calculates what a small change in each individual weight would do to the loss function
  2. Adjusts each parameter based on its gradient i.e take a small step in the determined direction
  3. Repeats steps 1 and 2 until the loss function is as low as possible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the actual gradient?

A

The gradient of a function is the vector of partial derivatives with respect to all the independent variables and always points in the direction of the steepest increase in the function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is an approach to avoid getting stuck in a local minima while using a gradient descent?

A

Use a proper learning rate - usually a small number like .001 that are multiplied to scale the gradients. This ensures that any changes made to the weights are quite small because if you make too big of jumps then you run the risk of skipping over the optimal value for a given weight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a stochastic gradient descent?

A

Uses a subset of training examples rather than the entire lot

SGD is an implementation of gradient descent that uses batches on each pass

Uses momentum to accumulate gradients

Less intensive computationally

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an Adagrad

A

Adapts learning rate to individual features

Some weights will have different learning rates

Ideal for sparse datasets with many input examples missing

Learning rate tends to get small with time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

RMSprop?

A

specialized version of Adagrad

Accumulates gradients in a fixed window

Similar to adaprop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Adam?

A

Stands for adapative momentum estimation and is another way of using past gradients to calculate the carbon gradient

Uses the concept of momentum which is a way of telling the neural network whether we want past changes to affect new change by adding fractions of the previous gradients to the current one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the overall goal of optimizers in neural networks?

A

Minimizing the loss function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly