06 - Regularization Flashcards

Question 1

Q

What is regularization, what is it used for, examples?

Answer

A

Regularization describes modifications to the learning that aim to reduce generalization error but not the training error.
When working with high capacity models (many layers & parameters) there is a high chance of overfitting. By reducing the model capacity we risk underfitting. The probelm is to find the sweet spot.
Regularization helps with finding a good trade off between capacity and robustness.
→ We introduce regularization to get a realistic training curve, which is good for both the training and the validation set (no overfitting but the perfect generalization)

Examples:
1. Parameter norm penalties: unstable parameter solutions are a classical problem in optimization
2. Data augumentation: artificially expanding the training dataset to allow the model to see a larger variation of training examples (adding noise, rotation etc)
3. Noise injection: addition of noise to parts of the model
1. input noise (like data augumentation)
2. noise on the units: dropout

Question 2

Q

What is MAP Inference?

Answer

A

Technique to find the most likely configuration of variables given some observed data. The goal of MAP inference is to find the set of variable values that maximizes the probability of the observed data, given the model’s prior assumptions.

So same goal as MLE, but MLE does not incorporate prior knowledge about parameters.

MAP takes into account prior knowledge about the parameters in the form of a prior probability distribution. This can be helpful if data is limited or noisy. It can help regularizing the estimation process and prevet overfitting. It can also provide more robust and stable estimates by downweighing the influence of outliers

MLE is a good choice when the data is abundant and there is no prior information. On the other hand, MAP can be useful when there is prior information or limited data.

Question 3

Q

Explain Weight Decay

Answer

A

L2 regularization, is used to prevent overfitting by adding a penalty term to the loss function that discourages large weights in the model. The idea is that by adding a penalty term that encourages the weights to be small, the model will be less sensitive to the specific training examples and will be able to generalize better to new, unseen examples.

The penalty term is proportional to the square of the weights and the proportionality constant is called the weight decay coefficient (lambda (λ)). A larger value of lambda will result in stronger regularization and smaller weight values.

Question 4

Q

What is augumentation used for and what are some augumentation strategies?

Answer

A

Especially usefull if not a lot of training data is available.

Typical strategies are:

Noise (additive like Gaussian og multiplicative like bernoulli (dropout))
Addine transformations (translation, rotation, scaling, shearing)
Trunctation (Cropping)
Non-linear operations (brightness)

Question 5

Q

Explain Early Stopping

Answer

A

Adding a patience as a threshold for how long the validation loss is allowed to increase before training is stopped. This is used to reduce overfitting.

Question 6

Q

Explain dropout

Answer

A

The idea is to randomly drop out (i.e., set to zero) a certain number of neurons in the network
By randomly dropping out neurons, dropout forces the network to learn multiple independent representations of the data, which helps to prevent overfitting by making the network less reliant on any one feature or set of features.

Either chose randomly or use a bernoulli variable. Rescale with 1/(1-p) to ensure the net input to any neuron stays the same.
Common values for p: 0-0.2 for input and 0.5 for hidden neurons. Do not use dropout on the outputs, it would not make sense since they are supposed to do something else.

The more dropout the better test error (see green function above, 1-p is the dropout rate), but too much is also not good, 0.5 is often the sweet spot.

Dropout subnet:
- Each time we drop a set of units (each epoch we use a diffferent dropout vector), we create a new subnetwork. All of these share the same weights. So for each epoch/minibatch dropout samples a new model. all of these subnets are in the end combined

Question 7

Q

Norm Penalty:

Answer

A

weight regularization, is a technique used in deep neural networks (DNNs) to prevent overfitting by adding a penalty term to the loss function that discourages large weights in the model. The idea is that by adding a penalty term that encourages the weights to be small, the model will be less sensitive to the specific training examples and will be able to generalize better to new, unseen examples.

L1 regularization: adds a penalty term to the loss function that is proportional to the absolute value of the weights. L1 regularization tends to produce sparse models, where many of the weights are exactly zero.

L2 regularization: Also known as Ridge regularization, it adds a penalty term to the loss function that is proportional to the square of the weights. L2 regularization tends to produce models where the weights are small but non-zero.