Week 2 Flashcards
When using batch gradient descent, the loss function must decrease at every iteration, if it ever goes up, even on one iteration, then there’s something wrong, maybe the learning rate is too big. True/False
|Understanding mini-batch gradient descent 00:30
True
On mini-batch gradient descent, if you plot the progress of loss function, it may not decrease every iteration. It should trend downward, but it’s also a little noisy. True/False. Why?
|Understanding mini-batch gradient descent 00:43
True, because each iteration is like we are using a different training set, so the cost function will oscillate a bit.
If the mini-batch size=1, the algorithm is called ____
|Understanding mini-batch gradient descent
2:49
Stochastic gradient descent
Stochastic gradient descent, converges to the minimum True/False
|Understanding mini-batch gradient descent
4:47
False, Stochastic gradient descent, won’t converge, it’ll oscillate around the region of the minimum.
Stochastic gradient descent is noisy, unlike batch gradient descent. True/False
|Understanding mini-batch gradient descent
4:15
True
Does mini-batch gradient descent always converge or oscillate in a small region? If no, what can be done about it?
|Understanding mini-batch gradient descent
08:04
No, we can use learning rate decay to handle this problem.
What batch size do we use for smaller training sets (less than 2000)?
A) Batch gradient descent
B) Mini-batch gradient descent
C) Stochastic gradient descent
|Understanding mini-batch gradient descent
08:30
A, there’s no point in using mini-batch gradient descent because you can process the whole training set fast
What are typical mini-batch sizes?
|Understanding mini-batch gradient descent
09:03
Anything from 64 to 512
Why does sometimes code runs faster if the mini-batch size is a power of 2?
|Understanding mini-batch gradient descent
09:10
Because of the way computer memory is laid out and accessed
What’s Exponentially Weighted Moving Average formula
|Exponentially Weighted Average
00:00
V0= 0
Vt=Beta × V(t-1)+(1-Beta) × Theta t
Vt= average over the last 1/(1-Beta) Theta
V: Exponentially Weighted Average of Theta
For example if the variable theta is temperature on a certain day then and Beta=0.9 then Vt is the average over the last 10 days’ temp
What is the effect of higher Betas in the Exponentially Weighted Moving Average?
|Exponentially Weighted Average
03:36
Higher beta:
-smoother average plot (because it averages over more days)
-it adapts slowly when the variable Theta changes, it gives more weight (according to the formula) to the previous values
Note: Lower Betas’ effect is inverse ( it’s not smooth but it adapts quicker)
Exponentially Weighted Average is the key component to several optimization algorithms for training a NN. True/False
|Understanding Exponentially Weighted Average
00:00
True
In the Exponentially Weighted Average of temperature example, Why does 1/(1-Beta) give us the number of days used for computing the weighted average?
|Understanding Exponentially Weighted Average
04:35
Since it’s an exponentially weighted average and the formula for calculating the Exponential Weighted Average of a day, contains an exponentially decaying part, it turns out that by using that formula, we can calculating up to which point the decay is not that severe. For example if Beta is .9 then 1/(1-Beta)=10, this means that for a given day, a significant part of average is built using the previous 10 days, and because of the exponentially decaying portion of the formula, the effect of the temperature of days before the 10 day threshold, is insignificant and is omitted.
How does bias correction help with Exponentially Weighted Average calculations?
|Bias Correction in Exponentially Weighted Average
Whole video
To calculate V1, we use V0, which is set to 0. But this causes bias at the beginning of the averaging and the values of Exponentially Weighted Average would be really small at first. Using this bias correction, instead of Vt, we calculate Vt/(1-Beta^t) as the Exponentially Weighted Average. As t goes up, Beta^t becomes smaller and when t is big enough, the bias correction term won’t have any effect, which is exactly what we want.
Do we use Bias Correction in ML’s implementations of Exponentially Weighted Average?
|Bias Correction In Exponentially Weighted Average
03:36
No, because the initial bias is not that important. [After 10 iterations, the bias fades away|Gradient Descent with Momentum 07:00]