Chapter 17 Lift Performance With Learning Rate Schedules Flashcards

1
Q

What benefits does adapting the learning rate have? P 118

A

Adapting the learning rate for your stochastic gradient descent optimization procedure can increase performance and reduce training time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Two popular and easy to use learning rate schedules are ____ P 118

A
  1. Decrease the learning rate gradually based on the epoch. (Time-based learning rate schedule)
  2. Decrease the learning rate using punctuated large drops at specific epochs. (Drop-Based Learning Rate Schedule)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Keras has a time-based learning rate schedule built in. The stochastic gradient descent optimization algorithm implementation in the SGD class has an argument called____. This argument is used in the time-based learning rate decay schedule equation which is:____
|P 119

A

decay
LearningRate = LearningRate * 1 /(1 + decay * epoch)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is “nesterov”parameter in SGD class in Keras? External

A

nesterov: boolean. Whether to apply Nesterov momentum. Defaults to False.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In Keras Metric values are displayed during____and logged to the ____ object returned by it. They are also returned by ____. External

A

fit() , History, model.evaluate()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

It can be a good idea to use momentum when using an adaptive learning rate. True/False P 120

A

True

For example, we may have an initial learning rate of 0.1 and drop it by a factor of 0.5 every 10 epochs. The first 10 epochs of training would use a value of 0.1, in the next 10 epochs a learning rate of 0.05 would be used, and so on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A popular learning rate schedule used with deep learning models is to systematically drop the learning rate at specific times during training. Often this method is implemented by dropping the learning rate by ____ every fixed number of epochs. P 122

A

half

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

We can implement Drop-Based Learning Rate Schedule in Keras using the ____ callback, when fitting the model. P 122

A

LearningRateScheduler

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The time-based learning rate decay function is used when setting up____, as the ____ parameter. The drop-based learning rate decay function is used when ____ and as the ____ parameter. External

A

SGD, “decay”, fitting the model, “callbacks”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The LearningRateScheduler callback allows us to define a function to call that takes the ____ as an argument and returns the ____ to use in stochastic gradient descent. When used, the learning rate specified by stochastic gradient descent is ____. P 122

A

epoch number, learning rate, ignored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What’s the function, used for step-based learning rate decay? What is its name conventionally? P 122

A

LearningRate = InitialLearningRate * DropRate ^floor( (1+Epoch)/ EpochDrop )
Step_decay()
Where InitialLearningRate is the learning rate at the beginning of the run, EpochDrop is how often the learning rate is dropped in epochs and DropRate is how much to drop the learning rate each time it is dropped.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the below code do? P 123

# learning rate schedule
def step_decay(epoch):
  initial_lrate = 0.1
  drop = 0.5
  epochs_drop = 10.0
  lrate =
initial_lrate×math.pow(drop, math.floor((1+epoch)/epochs_drop))
  return lrate
# create model
model = Sequential()
model.add(Dense(34, input_dim=34, kernel_initializer= "normal" , activation= "relu" ))
model.add(Dense(1, kernel_initializer= "normal" , activation= "sigmoid" ))
# Compile model
epochs = 50
learning_rate = 0.1
momentum = 0.9
sgd = SGD(learning_rate=learning_rate, momentum=momentum, decay=0, nesterov=False)
model.compile(loss= "binary_crossentropy" , optimizer=sgd, metrics=[ "accuracy" ])
# learning schedule callback
lrate = LearningRateScheduler(step_decay)
callbacks_list = [lrate]
# Fit the model
hist=model.fit(X, Y, validation_split=0.33,callbacks=callbacks_list, epochs=epochs, batch_size=28, verbose=1)
A

It uses the drop-based learning rate decay during the training of the model.

Note that it ignores the learning rate in SGD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the below code do? P 122

create model
model = Sequential()
model.add(Dense(34, input_dim=34, kernel_initializer= "normal" , activation= "relu" ))
model.add(Dense(1, kernel_initializer= "normal" , activation= "sigmoid" ))
# Compile model
epochs = 50
learning_rate = 0.1
decay_rate = learning_rate / epochs
momentum = 0.8
sgd = SGD(learning_rate=learning_rate, momentum=0, decay=decay_rate, nesterov=False)
model.compile(loss= "binary_crossentropy" , optimizer=sgd, metrics=[ "accuracy" ])
# Fit the model
hist=model.fit(X, Y, validation_split=0.33, epochs=epochs, batch_size=28, verbose=1)
A

Uses time-based learning rate decay during the training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is it a good idea to Increase the initial learning rate when using learning rate schedules? P 124

A

Because the learning rate will decrease, start with a larger value to decrease from. A larger learning rate will result in a lot larger changes to the weights, at least in the beginning, allowing you to benefit from fine tuning later.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is it a good idea to use large momentum when using learning rate schedules? P 124

A

Using a larger momentum value will help the optimization algorithm to continue to make updates in the right direction when your learning rate shrinks to small values

.9 is the usual momentum, I used .99 in the example and it adversely affected the performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is it a good idea to experiment with different schedules when using learning rate schedules? P 124

A

It will not be clear which learning rate schedule
to use so try a few with different configuration options and see what works best on your problem. Also try schedules that change exponentially and even schedules that respond to the accuracy of your model on the training or test datasets.

17
Q

The “decay” parameter of sgd optimizer in Keras is now deprecated and is replaced by Learning Rate Schedule APIs, what’s the difference between the old and the new version? External

A

In the book, we used the “decay” parameter for time-based LR decay, in this case, the “learning_rate” parameter was ignored and it didn’t matter what number was allocated to it.

Now, the “learning_rate” parameter is set to Learning Rate Schedule APIs, and the “decay” parameter is deprecated (it still works, won’t return an error, but isn’t recommended)

Keras Learning Rate Schedule APIs

Github doc