Optimizers Flashcards
What is an optimizer?
Optimizers minimize the loss function by tying together the loss function and model parameters by updating the neural network based on the output of the loss function by adjusting the weights and biases
The loss functions guide the optimizers - if you make it to the top of the mountain, you are the neural network, you descending down from the mountain represents minimizing the error, your feet is the loss function as it guides you down the mountain
What is a gradient descent?
Iterative algo that starts off at a random point on the loss function and travels down its slope in steps until it reaches the lowest point (minimum) of the function
Is basically what Backpropagation is - Gradient descent implemented on a network
How does a gradient descent work?
- Calculates what a small change in each individual weight would do to the loss function
- Adjusts each parameter based on its gradient i.e take a small step in the determined direction
- Repeats steps 1 and 2 until the loss function is as low as possible
What is the actual gradient?
The gradient of a function is the vector of partial derivatives with respect to all the independent variables and always points in the direction of the steepest increase in the function
What is an approach to avoid getting stuck in a local minima while using a gradient descent?
Use a proper learning rate - usually a small number like .001 that are multiplied to scale the gradients. This ensures that any changes made to the weights are quite small because if you make too big of jumps then you run the risk of skipping over the optimal value for a given weight
What is a stochastic gradient descent?
Uses a subset of training examples rather than the entire lot
SGD is an implementation of gradient descent that uses batches on each pass
Uses momentum to accumulate gradients
Less intensive computationally
What is an Adagrad
Adapts learning rate to individual features
Some weights will have different learning rates
Ideal for sparse datasets with many input examples missing
Learning rate tends to get small with time
RMSprop?
specialized version of Adagrad
Accumulates gradients in a fixed window
Similar to adaprop
Adam?
Stands for adapative momentum estimation and is another way of using past gradients to calculate the carbon gradient
Uses the concept of momentum which is a way of telling the neural network whether we want past changes to affect new change by adding fractions of the previous gradients to the current one
What is the overall goal of optimizers in neural networks?
Minimizing the loss function