SGD Flashcards

Question 1

Q

Def gradient flow of F

Answer

A

Gradient flow tens to the minimised as t -> inf

Question 2

Q

Discretise gradient flow

Question 3

Q

Euler approximation

Question 4

Q

(Ordinary) gradient descent

Answer

A

Iterative algorithm that progressively seeks a minimised with gradient updates:

Given some initial condition x₀

Question 5

Q

Motivate SGD

Answer

A

Seems one should minimise empirical risk

However computing gradient of (image) can be computationally expensive for large N as with empirical risk we are calculating the gradients for the entire set at once.

Gradient descent applied to (image) may lead to an overfitted network

Therefore we use SGD

Question 6

Q

Explain different between (S)GD

Answer

A

In SGD we RANDOMLY split training data (indexed by 1, …, N) into minim batches of size m &laquo_space;N
Where N = km

For k mini batches

Then we uniformly sample mini batches without replacement

Starting from some initial condition θ₀ the param vector is updated by (image)

Question 7

Q

Answer

A

Mini batch empirical risk

Question 8

Q

Mini batch gradient?

Answer

A

(Top right)

If data consists of IID samples we can consider mini batch gradient to be a NOISY but UNBIASED estimator of full gradient (bottom left)

Question 9

Q

Define one EPOCH

Answer

A

In context of SGD? One epoch is one pass through entire training data set by:

Question 10

Q

How do epochs differ

Answer

A

You reassign new mini batches, changing what is in each batch

But keeping new batches disjoint

Question 11

Q

SGD pseudo code

Question 12

Q

Usual practise for initial condition of SGD

Answer

A

Can be specified manually or
Can be derived from unsupervised pre training

Standard approach is to use a random initialiser

Question 13

Q

Glorot/xavier initialiser

Answer

A

Weights are drawn from a zero mean normal dist with var depending on number of units in next layer (1/#w)

Question 14

Q

When can we prove SGD converges to minimised

Answer

A

If objective function is convex with unique minimiser

We can then prove using theory of Robbins Monro stochastic approximation

Question 15

Q

Robbins monro stochastic approximation

Question 16

Q

Limitation of Robbins Monro stochastic approximation

Answer

Study These Flashcards

A

f_θ is rarely, if ever, Convex

We can’t really give guarantees of convergence to anything other than a local minimiser

Question 17

Q

Global minimiser

Answer

Study These Flashcards

A

Common school of thought is that we don’t want global minimiser of (image), if it exists, as it might over fit

Question 18

Q

Justify SGD being sub optimal

Answer

Study These Flashcards

A

Taking a step in direction (image) in (S)GD need not be optimal

For any η > 0, if η is too high, SGD may overshoot or zig zag

If too low, convergence may be too slow

Therefore alternatives proposed

Question 19

Q

Alternatives to Sgd

Answer

Study These Flashcards

A

Nesterov Momentum Method (Ilya)
RMSProp (Geoffrey Hinton)
ADAM

Question 20

Q

Explain ADAM

Answer

Study These Flashcards

A

Parameter direction is determined by an EXPONENTIALLY WEIGHTED MOVING AVERAGE of previous gradients

While η is adaptivley tuned by an EWMA of squares of previous gradients

SGD Flashcards

(20 cards)