regularization Flashcards
why regularization
prevent overfitting
how l1 l2 regularization prevent overfitting
shrinks the coefficient (w) towards 0 –> discourage a more complex model
what is ridge regression (L2 regularization)
Loss + lambda × sum||w[l]||^2
what is lasso regression (L1 regularization)
loss + lambda × sum(||w||)
what is the difference between L1 and L2 (lasso vs ridge)?
why l1 better? why l2 better?
ridge regression coefficient estimates will be exclusively non 0 (might turning some coeffs to almost but never 0) while lasso coefficients can be 0 (many coeffs can be 0 simultaneously) –> lasso also does feature selection and yields a sparse model.
l2 puts extra penalty fo large weights (coz its square) and its span of evaluating the strength of penalty is larger
l1 is more time efficient
what does regularization achieve?
reduces variance without significantly increase bias
how to select lambda
as lambda increase, it will reduce variance but after sometime it will start losing important properties in the data –> finding the optimal lambda (high enough)
what is dropout regularization
during training, some layer outputs are dropped out at random–> different number of nodes and connections –> requires nodes within layers to take more or less responsibility for the input (some will have to learn more)
why dropout helps overfitting?!
it cant rely on 1 input because it might be randomly dropped out
neurons will not learn redundant details of inputs
how to choose other hyperparams when using dropout?
high learning rate: tgt with dropout noise may help explore more area of loss function and find a better minimum
drawbacks of dropout
takes 2-3 times longer to train the nn
where to put dropout layer?
after fully connected layers, not convolutional layers as conv layers alr have fewer params -> they need less regularisation
other regularization techniques such as batch normalization in CNN have overtaken dropout
what is data augmentation
a regularization technique: modify the training data to create new data (randomly crop, rotate, …)
what is early stopping
a regularization techinque: stop training early before it overfits
what is the drawback of early stopping
cannot separate the tasks of optimize cost fct and not overfit, always have to consider both while if using L2 can train as long as poss, just focus on keep training