ANN Lecture 3 - Backpropagation and Gradient Descent Flashcards
Error Surface
The error surface visualizes the loss as a function dependent on the parameters. The aim is to find
the global minimum of the error surface!
Finding the global minimum of the error surface
The error surface is usually not computable.
-> Gradient Descent:
Evaluating the error surface only at one combination of weights and finding out which way is downhill.
Derivative
The derivative of a function is a function which describes the slope of the function at every point.
Partial Derivative
We call a derivative a partial derivative if the function we derive is dependent on more than one variable, but we only derive the function with respect to one derivative.
Gradient
The gradient is the vector with all partial derivatives.
Jacobian Matrix
The jacobian matrix is the generalization to the case of a function that maps multiple variables onto multiple dimensions.
Gradient Descent
If we can calculate the gradient, we can just walk a bit in the opposite direction (because the gradient goes uphill). This will step for step lead us to a minimum.
Gradient Descent Rule
Last Layer:
Gradient =
-(Target_k+1 - Output_k+1) * Sigma’(Drive_k+1) * Activation_k
Otherwise:
Gradient =
Error_k+1 * Weights_k+1 * Sigma’(Drive_k) * Activation_k-1
Gradient Descent Parameter Update
New Parameters =
Old Parameters - Learning Rate * Gradients
Full Batch Gradient Descent
New Parameters =
Old Parameters - Learning Rate * 1/N * Sum of all Gradients
Non Convex Error Surface
- A non-convex function has multiple so called critical
points - Optimization can get stuck at local minimum or saddle point because the slope is zero
Solution:
-> Mini Batch Gradient Descent
-> Stochastic Gradient Descent
Full Batch Gradient Descent (Pros & Cons)
Always minimizing the same error surface
+ Gradients show a clear direction
+ Guaranteed to converge to a solution
- If a local minimum is found it gets stuck
- Slow or even infeasible because of huge data set
Stochastic/Mini Batch Gradient Descent (Pros & Cons)
The error surface you minimize changes for each batch
- Gradient can differ heavily for each update
- Not guaranteed to converge to a solution
+ Has a chance of passing the local minimum of the error surface
+ Faster
Gradient Descent Algorithm
- Initialize parameters
- Chunk your data into batches of decided batch size
- For all batches:
a. Feed the data of one batch through the network
b. Calculate the gradient for the resulting loss
function
c. Update the parameters - After you’re done with all batches go back to 2.
Training Step &Epoche
Training Step:
Update of parameters for one Batch
Epoche:
Update of parameters for all Batches