SGD Flashcards
Def gradient flow of F
Gradient flow tens to the minimised as t -> inf
Discretise gradient flow
Euler approximation
(Ordinary) gradient descent
Iterative algorithm that progressively seeks a minimised with gradient updates:
Given some initial condition x0
Motivate SGD
Seems one should minimise empirical risk
However computing gradient of (image) can be computationally expensive for large N as with empirical risk we are calculating the gradients for the entire set at once.
Gradient descent applied to (image) may lead to an overfitted network
Therefore we use SGD
Explain different between (S)GD
In SGD we RANDOMLY split training data (indexed by 1, …, N) into minim batches of size m «_space;N
Where N = km
For k mini batches
Then we uniformly sample mini batches without replacement
Starting from some initial condition θ0 the param vector is updated by (image)
Mini batch empirical risk
Mini batch gradient?
(Top right)
If data consists of IID samples we can consider mini batch gradient to be a NOISY but UNBIASED estimator of full gradient (bottom left)
Define one EPOCH
In context of SGD? One epoch is one pass through entire training data set by:
How do epochs differ
You reassign new mini batches, changing what is in each batch
But keeping new batches disjoint
SGD pseudo code
Usual practise for initial condition of SGD
Can be specified manually or
Can be derived from unsupervised pre training
Standard approach is to use a random initialiser
Glorot/xavier initialiser
Weights are drawn from a zero mean normal dist with var depending on number of units in next layer (1/#w)
When can we prove SGD converges to minimised
If objective function is convex with unique minimiser
We can then prove using theory of Robbins Monro stochastic approximation
Robbins monro stochastic approximation
Limitation of Robbins Monro stochastic approximation
fθ is rarely, if ever, Convex
We can’t really give guarantees of convergence to anything other than a local minimiser
Global minimiser
Common school of thought is that we don’t want global minimiser of (image), if it exists, as it might over fit
Justify SGD being sub optimal
Taking a step in direction (image) in (S)GD need not be optimal
For any η > 0, if η is too high, SGD may overshoot or zig zag
If too low, convergence may be too slow
Therefore alternatives proposed
Alternatives to Sgd
Nesterov Momentum Method (Ilya)
RMSProp (Geoffrey Hinton)
ADAM
Explain ADAM
Parameter direction is determined by an EXPONENTIALLY WEIGHTED MOVING AVERAGE of previous gradients
While η is adaptivley tuned by an EWMA of squares of previous gradients