Dropout Flashcards
Overview
What is dropout in machine learning?
Dropout is a regularization technique used to prevent overfitting by reducing the reliance on individual neurons (Makes the network suffer)
Overview
What does dropout do during training?
During training, dropout randomly sets a fraction of the neurons’ outputs to zero.
Overview
What benefit does dropout provide in terms of features?
Dropout encourages the learning of more robust and generalizable features
Overview
What happens to the dropout layer during inference?
During inference, the dropout layer is deactivated.
Overview
How are neuron outputs affected during inference in dropout?
During inference, all neurons are used, but their outputs are scaled down by the same factor as during training.
e.g. if 2 out of 3 were used in training then the activation learned weights for 2 inputs not 3. By adjusting each input by 2/3, it means that the 3 inputs will still have the same magnitude as was used during training.
Forward phase of dropout
What is the effect of output masking in dropout during forward propagation?
The outputs of the selected neurons are set to zero, effectively dropping them out from the network for that particular forward pass.
Forward phase of dropout
What happens to the modified outputs from non-dropped out neurons during forward propagation?
The modified outputs from the non-dropped out neurons are propagated forward to the next layer in the network.
Forward phase of dropout
What happens to the dropped neurons on the next iteration?
A new set of neurons will be selected across the network according to a probability. Each time a different subset is used.
backward phase of dropout
What happens to the gradient/weights of the dropped neurons during backpropagation
The backward pass starts with the modified outputs from the previous layer. Since some neurons were dropped out during the forward pass, their gradients are also set to zero.
(Gradient Masking)
Why is extra scaling necessary in the forward pass during dropout?
To ensure that the expected contribution of each neuron remains consistent between training and inference phases.
e.g. if drop out caused 1 out of 3 neurons not to fire during training. (if each neuron has a max weight of 1), The activation function is expecting a max 2. Since at any given point only two neurons would enter the activation function. Now when 3 neurons are all being used during inference, our activation function could end up with a max weight of 3. To keep the magnitude the same, the weights must be adjusted by 2/3 (called the keep probability.) ,
What is the concept of inverted dropout during training?
All nodes that were not dropped out are scaled up by the inverse of the dropout rate.
This increase means that if 1 out of 3 neurons is dropped out, we multiply the weights of the remaining 2 by 3/2. So if each neuron had a max weight of 1 before, it now has a weight of 1.5. The activation function then learns off of a magnitude of 3.
At inference time when all three neurons are in use (no dropout) the input doesn’t need scaling since the activation and adjustments learned values in a magnitude of 3 (the non dropped out magnitude)
Why is scaling applied during training in inverted dropout?
To compensate for dropped neurons and maintain the total contribution of remaining neurons similar to no dropout.
How does the network behave at inference time in inverted dropout?
It works the same as if dropout wasn’t present, requiring no scaling.
What problem arises when applying standard dropout to each activation of a convolutional feature map before a 1 × 1 convolution layer?
It leads to increased training time without effectively preventing overfitting, mainly due to spatial correlation among feature map activations in fully convolutional networks.
With CNN what happens to the gradient contribution of certain neurons when dropout is applied in a neural network?
Some neurons may have zero contributions due to dropout, but in an image others will still exist and despite the holes dropout created the remaining “pixels” still have enough information to over come the missing data. (Strong correlation of pixels.)