Early stopping and Batch normalization-Kaggle Flashcards
A model’s capacity refers to ____ and ____ of the patterns it is able to learn. For neural networks, this will largely be determined by ____ and____.
the size, complexity, how many neurons it has, how they are connected together
____ networks have an easier time learning more linear relationships, while ____ networks prefer more nonlinear ones. Which is better just depends on the dataset.
Wider, deeper
The early stopping callback will run after every batch. True/False
False, The early stopping callback will run after every epoch.
What does the below code do?
early_stopping = callbacks.EarlyStopping( min_delta=0.001, # minimium amount of change to count as an improvement patience=20, # how many epochs to wait before stopping restore_best_weights=True, ) model = keras.Sequential([ layers.Dense(512, activation='relu', input_shape=[8]), layers.Dense(512, activation='relu'), layers.Dense(512, activation='relu'), layers.Dense(1), ]) model.compile( optimizer='adam', loss='mae', ) history = model.fit( X, Y, validation_split=.2, batch_size=10, epochs=20, callbacks=[early_stopping], # put your callbacks in a list verbose=0, # turn off training log )
We defined an early stopping callback, its parameters say: “If there hasn’t been at least an improvement of 0.001 in the validation loss over the previous 20 epochs, then stop the training and keep the best model you found.”
It can sometimes be hard to tell if the validation loss is rising due to overfitting or just due to random batch variation. The parameters allow us to set some allowances around when to stop.
After defining the callback, add it as an argument in ____(you can have several, so put it in a list). Choose a large number of epochs when using early stopping, more than you’ll need.
fit
“batch normalization” (or “batchnorm”), can help correct training that is ____ or ____.
slow, unstable
With neural networks, it’s generally a good idea to put all of your data on a common scale, perhaps with something like scikit-learn’s StandardScaler or MinMaxScaler. The reason is that SGD will shift the network weights in proportion to how large an activation the data produces. Features that tend to produce activations of very different sizes can make for unstable training behavior.
Most often, batchnorm is added as an aid to the ____ (though it can sometimes also help ____). Models with batchnorm tend to need fewer epochs to complete training. Moreover, batchnorm can also fix various problems that can cause the training to get “stuck”. Consider adding batch normalization to your models, especially if you’re having trouble during training.
optimization process, prediction performance,