Chapter 17 Lift Performance With Learning Rate Schedules Flashcards
What benefits does adapting the learning rate have? P 118
Adapting the learning rate for your stochastic gradient descent optimization procedure can increase performance and reduce training time.
Two popular and easy to use learning rate schedules are ____ P 118
- Decrease the learning rate gradually based on the epoch. (Time-based learning rate schedule)
- Decrease the learning rate using punctuated large drops at specific epochs. (Drop-Based Learning Rate Schedule)
Keras has a time-based learning rate schedule built in. The stochastic gradient descent optimization algorithm implementation in the SGD class has an argument called____. This argument is used in the time-based learning rate decay schedule equation which is:____
|P 119
decayLearningRate = LearningRate * 1 /(1 + decay * epoch)
What is “nesterov”parameter in SGD class in Keras? External
nesterov: boolean. Whether to apply Nesterov momentum. Defaults to False.
In Keras Metric values are displayed during____and logged to the ____ object returned by it. They are also returned by ____. External
fit() , History, model.evaluate()
It can be a good idea to use momentum when using an adaptive learning rate. True/False P 120
True
For example, we may have an initial learning rate of 0.1 and drop it by a factor of 0.5 every 10 epochs. The first 10 epochs of training would use a value of 0.1, in the next 10 epochs a learning rate of 0.05 would be used, and so on.
A popular learning rate schedule used with deep learning models is to systematically drop the learning rate at specific times during training. Often this method is implemented by dropping the learning rate by ____ every fixed number of epochs. P 122
half
We can implement Drop-Based Learning Rate Schedule in Keras using the ____ callback, when fitting the model. P 122
LearningRateScheduler
The time-based learning rate decay function is used when setting up____, as the ____ parameter. The drop-based learning rate decay function is used when ____ and as the ____ parameter. External
SGD, “decay”, fitting the model, “callbacks”
The LearningRateScheduler callback allows us to define a function to call that takes the ____ as an argument and returns the ____ to use in stochastic gradient descent. When used, the learning rate specified by stochastic gradient descent is ____. P 122
epoch number, learning rate, ignored
What’s the function, used for step-based learning rate decay? What is its name conventionally? P 122
LearningRate = InitialLearningRate * DropRate ^floor( (1+Epoch)/ EpochDrop )
Step_decay()
Where InitialLearningRate is the learning rate at the beginning of the run, EpochDrop is how often the learning rate is dropped in epochs and DropRate is how much to drop the learning rate each time it is dropped.
What does the below code do? P 123
# learning rate schedule def step_decay(epoch): initial_lrate = 0.1 drop = 0.5 epochs_drop = 10.0 lrate = initial_lrate×math.pow(drop, math.floor((1+epoch)/epochs_drop)) return lrate # create model model = Sequential() model.add(Dense(34, input_dim=34, kernel_initializer= "normal" , activation= "relu" )) model.add(Dense(1, kernel_initializer= "normal" , activation= "sigmoid" )) # Compile model epochs = 50 learning_rate = 0.1 momentum = 0.9 sgd = SGD(learning_rate=learning_rate, momentum=momentum, decay=0, nesterov=False) model.compile(loss= "binary_crossentropy" , optimizer=sgd, metrics=[ "accuracy" ]) # learning schedule callback lrate = LearningRateScheduler(step_decay) callbacks_list = [lrate] # Fit the model hist=model.fit(X, Y, validation_split=0.33,callbacks=callbacks_list, epochs=epochs, batch_size=28, verbose=1)
It uses the drop-based learning rate decay during the training of the model.
Note that it ignores the learning rate in SGD
What does the below code do? P 122
create model model = Sequential() model.add(Dense(34, input_dim=34, kernel_initializer= "normal" , activation= "relu" )) model.add(Dense(1, kernel_initializer= "normal" , activation= "sigmoid" )) # Compile model epochs = 50 learning_rate = 0.1 decay_rate = learning_rate / epochs momentum = 0.8 sgd = SGD(learning_rate=learning_rate, momentum=0, decay=decay_rate, nesterov=False) model.compile(loss= "binary_crossentropy" , optimizer=sgd, metrics=[ "accuracy" ]) # Fit the model hist=model.fit(X, Y, validation_split=0.33, epochs=epochs, batch_size=28, verbose=1)
Uses time-based learning rate decay during the training
Why is it a good idea to Increase the initial learning rate when using learning rate schedules? P 124
Because the learning rate will decrease, start with a larger value to decrease from. A larger learning rate will result in a lot larger changes to the weights, at least in the beginning, allowing you to benefit from fine tuning later.
Why is it a good idea to use large momentum when using learning rate schedules? P 124
Using a larger momentum value will help the optimization algorithm to continue to make updates in the right direction when your learning rate shrinks to small values
.9 is the usual momentum, I used .99 in the example and it adversely affected the performance