SGD Flashcards

1
Q

Def gradient flow of F

A

Gradient flow tens to the minimised as t -> inf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Discretise gradient flow

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Euler approximation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(Ordinary) gradient descent

A

Iterative algorithm that progressively seeks a minimised with gradient updates:

Given some initial condition x0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Motivate SGD

A

Seems one should minimise empirical risk

However computing gradient of (image) can be computationally expensive for large N as with empirical risk we are calculating the gradients for the entire set at once.

Gradient descent applied to (image) may lead to an overfitted network

Therefore we use SGD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain different between (S)GD

A

In SGD we RANDOMLY split training data (indexed by 1, …, N) into minim batches of size m &laquo_space;N
Where N = km

For k mini batches

Then we uniformly sample mini batches without replacement

Starting from some initial condition θ0 the param vector is updated by (image)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
A

Mini batch empirical risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Mini batch gradient?

A

(Top right)

If data consists of IID samples we can consider mini batch gradient to be a NOISY but UNBIASED estimator of full gradient (bottom left)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define one EPOCH

A

In context of SGD? One epoch is one pass through entire training data set by:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do epochs differ

A

You reassign new mini batches, changing what is in each batch

But keeping new batches disjoint

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SGD pseudo code

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Usual practise for initial condition of SGD

A

Can be specified manually or
Can be derived from unsupervised pre training

Standard approach is to use a random initialiser

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Glorot/xavier initialiser

A

Weights are drawn from a zero mean normal dist with var depending on number of units in next layer (1/#w)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When can we prove SGD converges to minimised

A

If objective function is convex with unique minimiser

We can then prove using theory of Robbins Monro stochastic approximation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Robbins monro stochastic approximation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Limitation of Robbins Monro stochastic approximation

A

fθ is rarely, if ever, Convex

We can’t really give guarantees of convergence to anything other than a local minimiser

17
Q

Global minimiser

A

Common school of thought is that we don’t want global minimiser of (image), if it exists, as it might over fit

18
Q

Justify SGD being sub optimal

A

Taking a step in direction (image) in (S)GD need not be optimal

For any η > 0, if η is too high, SGD may overshoot or zig zag

If too low, convergence may be too slow

Therefore alternatives proposed

19
Q

Alternatives to Sgd

A

Nesterov Momentum Method (Ilya)
RMSProp (Geoffrey Hinton)
ADAM

20
Q

Explain ADAM

A

Parameter direction is determined by an EXPONENTIALLY WEIGHTED MOVING AVERAGE of previous gradients

While η is adaptivley tuned by an EWMA of squares of previous gradients