Chapter 16 Reduce Overfitting With Dropout Regularization Flashcards

1
Q

What does drop out technique do? P 112

A

Dropout is a technique where randomly selected neurons are ignored during training. They are dropped-out randomly with a given probability (e.g. 20%), each weight update cycle. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is complex co-adaptations? P 112

A

Weights of neurons are tuned for specific features providing some specialization. Neighboring
neurons become to rely on this specialization, which if taken too far can result in a fragile model
too specialized to the training data. This reliance on context for a neuron during training is
referred to as complex co-adaptations.

External definition:

In neural network, co-adaptation means that some neurons are highly dependent on others. If those independent neurons receive “bad” inputs, then the dependent neurons can be affected as well, and ultimately it can significantly alter the model performance, which is what might happen with overfitting. (Me: imagine a neuron capturing a part of data and neighboring neurons capture more details of that part, and they rely on the input given to the main neuron)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does dropping out neurons help reduce overfitting? P 112

A

You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons (refer to complex co-adaptations definition).
This is believed to result in multiple independent internal representations being learned by the network.
The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The drop outs happen at each weight update cycle. True/False? P 113

A

True, Dropout is easily implemented by randomly selecting nodes to be dropped-out with a given
probability (e.g. 20%) each weight update cycle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Dropout is only used during the training of a model and is not used when evaluating the skill of the model. True/False? P 113

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the below code do? P 114

def baseline_model():
  # create model
  model = Sequential()
  model.add(Dropout(0.2, input_shape=(30,)))
  model.add(Dense(30, activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dense(15,  activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dense(1 , activation= "sigmoid" ))
  return model
A

Dropout can be applied to input neurons called the visible layer. In the example below we add a new Dropout layer between the input (or visible layer) and the first hidden layer. The dropout rate is set to 20%, meaning one in five inputs will be randomly excluded from each update cycle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is kernel_constraint set to 3, drop-out rate to .2, learning rate to 0.01 and momentum to .9? P 114

def baseline_model():
  # create model
  model = Sequential()
  model.add(Dropout(0.2, input_shape=(30,)))
  model.add(Dense(30,  activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dense(15,  activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dense(1 , activation= "sigmoid" ))
  return model
sgd = SGD(learning_rate=0.01, momentum=0.9, decay=0.0, nesterov=False)
A

As recommended in the original paper on dropout, a constraint is imposed on the weights for each hidden layer, ensuring that the maximum norm of the weights does not exceed a value of 3. The learning rate was lifted by one order of magnitude and the momentum was increased to 0.9. These increases in the learning rate were also recommended in the original dropout paper.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s momentum in sgd? External

A

A method which helps accelerate gradients vectors in the right directions, thus leading to faster converging.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the below code do? P 115

# define baseline model
def baseline_model():
  # create model
  model = Sequential()
  model.add(Dense(30, input_dim=30, activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dropout(0.2))
  model.add(Dense(15,  activation= "relu",kernel_constraint=maxnorm(3) ))
  model.add(Dropout(0.2))
  model.add(Dense(1 , activation= "sigmoid" ))
  return model
sgd = SGD(learning_rate=0.01, momentum=0.9, decay=0.0, nesterov=False)
A

Dropout can be applied to hidden neurons in the body of your network model. In the example below dropout is applied between the two hidden layers and between the last hidden layer and the output layer. Again a dropout rate of 20% is used as is a weight constraint on those layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Generally use a small dropout value of … of neurons with … providing a good starting point. A probability too low has … effect and a value too high results in … by the network. P 117

A

20%-50%, 20%, minimal, under-learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why are you likely to get better performance when dropout is used on a larger network? P 117

A

Because it gives the model more of an opportunity to learn independent representations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of … to … and use a high momentum value of …or…. P 117

A

10 to 100, 0.9, 0.99

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights such as max-norm regularization with a size of …or … has been shown to improve results. P 117

A

4, 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly