SGD Flashcards
Def gradient flow of F
Gradient flow tens to the minimised as t -> inf
Discretise gradient flow
Euler approximation
(Ordinary) gradient descent
Iterative algorithm that progressively seeks a minimised with gradient updates:
Given some initial condition x0
Motivate SGD
Seems one should minimise empirical risk
However computing gradient of (image) can be computationally expensive for large N as with empirical risk we are calculating the gradients for the entire set at once.
Gradient descent applied to (image) may lead to an overfitted network
Therefore we use SGD
Explain different between (S)GD
In SGD we RANDOMLY split training data (indexed by 1, …, N) into minim batches of size m «_space;N
Where N = km
For k mini batches
Then we uniformly sample mini batches without replacement
Starting from some initial condition θ0 the param vector is updated by (image)
Mini batch empirical risk
Mini batch gradient?
(Top right)
If data consists of IID samples we can consider mini batch gradient to be a NOISY but UNBIASED estimator of full gradient (bottom left)
Define one EPOCH
In context of SGD? One epoch is one pass through entire training data set by:
How do epochs differ
You reassign new mini batches, changing what is in each batch
But keeping new batches disjoint
SGD pseudo code
Usual practise for initial condition of SGD
Can be specified manually or
Can be derived from unsupervised pre training
Standard approach is to use a random initialiser
Glorot/xavier initialiser
Weights are drawn from a zero mean normal dist with var depending on number of units in next layer (1/#w)
When can we prove SGD converges to minimised
If objective function is convex with unique minimiser
We can then prove using theory of Robbins Monro stochastic approximation
Robbins monro stochastic approximation