Chapter 16 Reduce Overfitting With Dropout Regularization Flashcards
What does drop out technique do? P 112
Dropout is a technique where randomly selected neurons are ignored during training. They are dropped-out randomly with a given probability (e.g. 20%), each weight update cycle. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
What is complex co-adaptations? P 112
Weights of neurons are tuned for specific features providing some specialization. Neighboring
neurons become to rely on this specialization, which if taken too far can result in a fragile model
too specialized to the training data. This reliance on context for a neuron during training is
referred to as complex co-adaptations.
External definition:
In neural network, co-adaptation means that some neurons are highly dependent on others. If those independent neurons receive “bad” inputs, then the dependent neurons can be affected as well, and ultimately it can significantly alter the model performance, which is what might happen with overfitting. (Me: imagine a neuron capturing a part of data and neighboring neurons capture more details of that part, and they rely on the input given to the main neuron)
How does dropping out neurons help reduce overfitting? P 112
You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons (refer to complex co-adaptations definition).
This is believed to result in multiple independent internal representations being learned by the network.
The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.
The drop outs happen at each weight update cycle. True/False? P 113
True, Dropout is easily implemented by randomly selecting nodes to be dropped-out with a given
probability (e.g. 20%) each weight update cycle.
Dropout is only used during the training of a model and is not used when evaluating the skill of the model. True/False? P 113
True
What does the below code do? P 114
def baseline_model(): # create model model = Sequential() model.add(Dropout(0.2, input_shape=(30,))) model.add(Dense(30, activation= "relu",kernel_constraint=maxnorm(3) )) model.add(Dense(15, activation= "relu",kernel_constraint=maxnorm(3) )) model.add(Dense(1 , activation= "sigmoid" )) return model
Dropout can be applied to input neurons called the visible layer. In the example below we add a new Dropout layer between the input (or visible layer) and the first hidden layer. The dropout rate is set to 20%, meaning one in five inputs will be randomly excluded from each update cycle.
Why is kernel_constraint set to 3, drop-out rate to .2, learning rate to 0.01 and momentum to .9? P 114
def baseline_model(): # create model model = Sequential() model.add(Dropout(0.2, input_shape=(30,))) model.add(Dense(30, activation= "relu",kernel_constraint=maxnorm(3) )) model.add(Dense(15, activation= "relu",kernel_constraint=maxnorm(3) )) model.add(Dense(1 , activation= "sigmoid" )) return model sgd = SGD(learning_rate=0.01, momentum=0.9, decay=0.0, nesterov=False)
As recommended in the original paper on dropout, a constraint is imposed on the weights for each hidden layer, ensuring that the maximum norm of the weights does not exceed a value of 3. The learning rate was lifted by one order of magnitude and the momentum was increased to 0.9. These increases in the learning rate were also recommended in the original dropout paper.
What’s momentum in sgd? External
A method which helps accelerate gradients vectors in the right directions, thus leading to faster converging.
What does the below code do? P 115
# define baseline model def baseline_model(): # create model model = Sequential() model.add(Dense(30, input_dim=30, activation= "relu",kernel_constraint=maxnorm(3) )) model.add(Dropout(0.2)) model.add(Dense(15, activation= "relu",kernel_constraint=maxnorm(3) )) model.add(Dropout(0.2)) model.add(Dense(1 , activation= "sigmoid" )) return model sgd = SGD(learning_rate=0.01, momentum=0.9, decay=0.0, nesterov=False)
Dropout can be applied to hidden neurons in the body of your network model. In the example below dropout is applied between the two hidden layers and between the last hidden layer and the output layer. Again a dropout rate of 20% is used as is a weight constraint on those layers.
Generally use a small dropout value of … of neurons with … providing a good starting point. A probability too low has … effect and a value too high results in … by the network. P 117
20%-50%, 20%, minimal, under-learning
Why are you likely to get better performance when dropout is used on a larger network? P 117
Because it gives the model more of an opportunity to learn independent representations.
Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of … to … and use a high momentum value of …or…. P 117
10 to 100, 0.9, 0.99
Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights such as max-norm regularization with a size of …or … has been shown to improve results. P 117
4, 5