Chapter 16 Reduce Overfitting With Dropout Regularization Flashcards

Question 1

Q

What does drop out technique do? P 112

Answer

A

Dropout is a technique where randomly selected neurons are ignored during training. They are dropped-out randomly with a given probability (e.g. 20%), each weight update cycle. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

Question 2

Q

What is complex co-adaptations? P 112

Answer

A

Weights of neurons are tuned for specific features providing some specialization. Neighboring
neurons become to rely on this specialization, which if taken too far can result in a fragile model
too specialized to the training data. This reliance on context for a neuron during training is
referred to as complex co-adaptations.

External definition:

In neural network, co-adaptation means that some neurons are highly dependent on others. If those independent neurons receive “bad” inputs, then the dependent neurons can be affected as well, and ultimately it can significantly alter the model performance, which is what might happen with overfitting. (Me: imagine a neuron capturing a part of data and neighboring neurons capture more details of that part, and they rely on the input given to the main neuron)

Question 3

Q

How does dropping out neurons help reduce overfitting? P 112

Answer

A

You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons (refer to complex co-adaptations definition).
This is believed to result in multiple independent internal representations being learned by the network.
The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.

Question 4

Q

The drop outs happen at each weight update cycle. True/False? P 113

Answer

A

True, Dropout is easily implemented by randomly selecting nodes to be dropped-out with a given
probability (e.g. 20%) each weight update cycle.

Question 5

Q

Dropout is only used during the training of a model and is not used when evaluating the skill of the model. True/False? P 113

Question 6

Q

What does the below code do? P 114

def baseline_model():
  # create model
  model = Sequential()
  model.add(Dropout(0.2, input_shape=(30,)))
  model.add(Dense(30, activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dense(15,  activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dense(1 , activation= "sigmoid" ))
  return model

Answer

A

Dropout can be applied to input neurons called the visible layer. In the example below we add a new Dropout layer between the input (or visible layer) and the first hidden layer. The dropout rate is set to 20%, meaning one in five inputs will be randomly excluded from each update cycle.

Question 7

Q

Why is kernel_constraint set to 3, drop-out rate to .2, learning rate to 0.01 and momentum to .9? P 114

def baseline_model():
  # create model
  model = Sequential()
  model.add(Dropout(0.2, input_shape=(30,)))
  model.add(Dense(30,  activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dense(15,  activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dense(1 , activation= "sigmoid" ))
  return model
sgd = SGD(learning_rate=0.01, momentum=0.9, decay=0.0, nesterov=False)

Answer

A

As recommended in the original paper on dropout, a constraint is imposed on the weights for each hidden layer, ensuring that the maximum norm of the weights does not exceed a value of 3. The learning rate was lifted by one order of magnitude and the momentum was increased to 0.9. These increases in the learning rate were also recommended in the original dropout paper.

Question 8

Q

What’s momentum in sgd? External

Answer

A

A method which helps accelerate gradients vectors in the right directions, thus leading to faster converging.

Question 9

Q

What does the below code do? P 115

# define baseline model
def baseline_model():
  # create model
  model = Sequential()
  model.add(Dense(30, input_dim=30, activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dropout(0.2))
  model.add(Dense(15,  activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dropout(0.2))
  model.add(Dense(1 , activation= "sigmoid" ))
  return model
sgd = SGD(learning_rate=0.01, momentum=0.9, decay=0.0, nesterov=False)

Answer

A

Dropout can be applied to hidden neurons in the body of your network model. In the example below dropout is applied between the two hidden layers and between the last hidden layer and the output layer. Again a dropout rate of 20% is used as is a weight constraint on those layers.

Question 10

Q

Generally use a small dropout value of … of neurons with … providing a good starting point. A probability too low has … effect and a value too high results in … by the network. P 117

Answer

A

20%-50%, 20%, minimal, under-learning

Question 11

Q

Why are you likely to get better performance when dropout is used on a larger network? P 117

Answer

A

Because it gives the model more of an opportunity to learn independent representations.

Question 12

Q

Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of … to … and use a high momentum value of …or…. P 117

Answer

A

10 to 100, 0.9, 0.99

Question 13

Q

Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights such as max-norm regularization with a size of …or … has been shown to improve results. P 117