Week 2 Flashcards by Mahsa Zamanifard

When using batch gradient descent, the loss function must decrease at every iteration, if it ever goes up, even on one iteration, then there’s something wrong, maybe the learning rate is too big. True/False
|Understanding mini-batch gradient descent 00:30

True

How well did you know this?

Not at all

Perfectly

On mini-batch gradient descent, if you plot the progress of loss function, it may not decrease every iteration. It should trend downward, but it’s also a little noisy. True/False. Why?
|Understanding mini-batch gradient descent 00:43

True, because each iteration is like we are using a different training set, so the cost function will oscillate a bit.

How well did you know this?

Not at all

Perfectly

If the mini-batch size=1, the algorithm is called ____
|Understanding mini-batch gradient descent
2:49

Stochastic gradient descent

How well did you know this?

Not at all

Perfectly

Stochastic gradient descent, converges to the minimum True/False
|Understanding mini-batch gradient descent
4:47

False, Stochastic gradient descent, won’t converge, it’ll oscillate around the region of the minimum.

How well did you know this?

Not at all

Perfectly

Stochastic gradient descent is noisy, unlike batch gradient descent. True/False
|Understanding mini-batch gradient descent
4:15

True

How well did you know this?

Not at all

Perfectly

Does mini-batch gradient descent always converge or oscillate in a small region? If no, what can be done about it?
|Understanding mini-batch gradient descent
08:04

No, we can use learning rate decay to handle this problem.

How well did you know this?

Not at all

Perfectly

What batch size do we use for smaller training sets (less than 2000)?
A) Batch gradient descent
B) Mini-batch gradient descent
C) Stochastic gradient descent
|Understanding mini-batch gradient descent
08:30

A, there’s no point in using mini-batch gradient descent because you can process the whole training set fast

How well did you know this?

Not at all

Perfectly

What are typical mini-batch sizes?
|Understanding mini-batch gradient descent
09:03

Anything from 64 to 512

How well did you know this?

Not at all

Perfectly

Why does sometimes code runs faster if the mini-batch size is a power of 2?
|Understanding mini-batch gradient descent
09:10

Because of the way computer memory is laid out and accessed

How well did you know this?

Not at all

Perfectly

What’s Exponentially Weighted Moving Average formula
|Exponentially Weighted Average
00:00

V0= 0
Vt=Beta × V(t-1)+(1-Beta) × Theta t

Vt= average over the last 1/(1-Beta) Theta

V: Exponentially Weighted Average of Theta

For example if the variable theta is temperature on a certain day then and Beta=0.9 then Vt is the average over the last 10 days’ temp

How well did you know this?

Not at all

Perfectly

What is the effect of higher Betas in the Exponentially Weighted Moving Average?
|Exponentially Weighted Average
03:36

Higher beta:
-smoother average plot (because it averages over more days)
-it adapts slowly when the variable Theta changes, it gives more weight (according to the formula) to the previous values

Note: Lower Betas’ effect is inverse ( it’s not smooth but it adapts quicker)

How well did you know this?

Not at all

Perfectly

Exponentially Weighted Average is the key component to several optimization algorithms for training a NN. True/False
|Understanding Exponentially Weighted Average
00:00

True

How well did you know this?

Not at all

Perfectly

In the Exponentially Weighted Average of temperature example, Why does 1/(1-Beta) give us the number of days used for computing the weighted average?
|Understanding Exponentially Weighted Average
04:35

Since it’s an exponentially weighted average and the formula for calculating the Exponential Weighted Average of a day, contains an exponentially decaying part, it turns out that by using that formula, we can calculating up to which point the decay is not that severe. For example if Beta is .9 then 1/(1-Beta)=10, this means that for a given day, a significant part of average is built using the previous 10 days, and because of the exponentially decaying portion of the formula, the effect of the temperature of days before the 10 day threshold, is insignificant and is omitted.

How well did you know this?

Not at all

Perfectly

How does bias correction help with Exponentially Weighted Average calculations?
|Bias Correction in Exponentially Weighted Average
Whole video

To calculate V1, we use V0, which is set to 0. But this causes bias at the beginning of the averaging and the values of Exponentially Weighted Average would be really small at first. Using this bias correction, instead of Vt, we calculate Vt/(1-Beta^t) as the Exponentially Weighted Average. As t goes up, Beta^t becomes smaller and when t is big enough, the bias correction term won’t have any effect, which is exactly what we want.

How well did you know this?

Not at all

Perfectly

Do we use Bias Correction in ML’s implementations of Exponentially Weighted Average?
|Bias Correction In Exponentially Weighted Average
03:36

No, because the initial bias is not that important. [After 10 iterations, the bias fades away|Gradient Descent with Momentum 07:00]

How well did you know this?

Not at all

Perfectly

What’s the idea behind Gradient Descent with Momentum?
|Gradient Descent with Momentum
00:00

Study These Flashcards

The idea is to compute an exponentially weighted average of the gradients, then use that gradient to update weights.
It will smooth out the oscillations in Gradient Descent

Vdw=Beta×Vdw+(1-Beta)dw
Vdb=Beta×Vdb+(1-Beta)db

W:=W-alpha×Vdw

B:=B-alpha×Vdb

Gradient Descent with Momentum almost always works better than straightforward gradient descent algorithm without momentum.True/False
|Gradient Descent with Momentum
09:08

Study These Flashcards

True

What does RMSprop algorithm do?
|RMSprop
00:00

Study These Flashcards

Reduces oscillations and speeds up gradient descent by using an Exponentially Weighted Average of the (element-wise) squares of the derivatives.

Exponentially Weighted Average of the squares of the weights:
SdW= Beta×SdW+(1-Beta)×dW^2

Exponentially Weighted Average of the squares of the biases:
SdB= Beta×SdB+(1-Beta)×dB^2

Note that RMSprop is NOT the same as Momentum

How does RMSprop update the weights and biases?
|RMSprop

Study These Flashcards

W:=W-alpha×(dW/√S_dW)
B:=B-alpha×(dB/√S_dB)
Note: S_dW and S_dB are the Exponentially Weighted Average of the squared weights and biases

Adam (ADAptive Moment estimation) optimization algorithm, is created by combining ____ and ____ algorithms.
|Adam Optimization Algorithm
00:54

Study These Flashcards

RMSprop, Momentum

What are the hyper-parameters of Adam?
|Adam Optimization Algorithm
04:50

Study These Flashcards

-Learning Rate alpha
-Beta1 (for the momentum part), usually 0.9
-Beta2 (for the RMSprop part) 0.999 is recommended
- Epsilon (for Bias correction but it’s unnecessary, since it doesn’t affect performance)= 10^-8

People usually use the default values for Beta1, Beta2 and Epsilon, and tune alpha

Does local optima exist in higher dimension problems in NNs? Explain your answer
|The problem of local optima
00:47,01:25

Study These Flashcards

It turns out, In higher dimensions, most points of zero gradient are not local optima, instead, points of zero gradient in a cost function are saddle points, because in order to have gradient 0, then in each direction ( variable, weight) it can either be a convex-like or a concave-like function, which is almost impossible in high dimensional problems. It’s more possible some are direction with a curve upward and some with a curve downward, which creates a saddle

In high dimensional problems, local optima aren’t the problem, the problem is the plateau (of the saddle shape) that can really slow down learning. True/False
|The problem of local optima
3:22

Study These Flashcards

True

What’s plateau region?
|The problem of local optima
3:22

Study These Flashcards

A region where the derivative is close to zero for a long time

How can we solve the plateau problem? |The problem of local optima 04:23

To pass the plateau region and reach minimum faster - of cost function - we can use algorithms such as: momentum, RMSprop and ADAM to help the learning algorithm

Week 2 Flashcards

(25 cards)