Class 9 10 11 Flashcards
What is regularization?
Regularization is a modification, which is used to reduce generalization error, not much the training error. If after some epochs the validation error decreases but the training error increases or remians unchancged then it means that there is overfitting. The main thing we want to acomplish by using regularization is to try to get rid of overfitting.Regularization is the tradeoff between bias and variance
What is parameter norm penalty?
Parameter norm penalty is a norm penalty which is added to the objective function J = j() + alpha. WEIGHT_NORM
It limits the capacity of the network and the bigger the alpha the more regularization is applied
What is Tikhanov/Ridge regression?
It is L2 regularization, it works as regularizing the weights.
1/2 ||w||^2 is the formula and it tries to drive weights into origin. Without the regularization the MSE 1/n|y-x|^2 has a solution if it is invertible but with a reularizaiton it is always invertible
What is L1 regression and how it enforces the weights to be sparse?
L1 regularization has the formula alpha* |w| where w can be represented as w1+w2+w3 ..
The derivative of it gives the sign(w)
It enforces the weights to be sparse, which means that some weights can be too big but some other weights are becoming zero.
1 10 w1 w2 5
w1 + 10w2 = 5 therefore there is a line as a solution and there wil be infinetely many solutions.
The l2 norm will have circles.
What is parameter norm regularization?
parameter norm regularization helps to get a simpler model, to avoid overfitting, because with deep networks we have complex models ,which is actually the main reason why we use regularization
Semi supervised learning and multi task learning, early stopping?
Semi supervised learning is combining the labeled and unlabeled data p(x) and p(y|x) to estimate p(y|x)
Multi task learning is, having different examples, which aree taken from differetnt but very similar tasks. The model is used amongs all the similar tasks which can be interpreted as parameter sharing.
Early stopping is returning the model which has the lowest validation error. If after N epoch there is no improvement then it stops training.Limits the parameter space to a neighbourhod of initial parameters
Parameter tying and sharing?
Two models A and B solves similar tasks
= |w(a) - w(b) | ^2 if the value of this decreases then it means that the two models are dissimilar (parameter tying)
parameter sharing(multi task learning) it helps to have a better generalization
What are ensemble methods? Give examples
Ensemble methods are training different models
Hopefully we want the models not to do the same mistakes of one model does.
Each model votes for the output
Bagging and boosting are two different ensemble methods.
Bagging tries to increase bias where it tries to reduce variance.
k different datasets, k is the number of the models(classifiers) used. A model is trained for each dataset. Differences in training set results in differences in resulting models.
Boosting is the opposite of it, it has higher capacity
What is dropout? Why it is a reegulariation? Explain dsopout briefly
Dropout can be thought of a bagging aplied to NNs. Obtained by removing hidden/input units by multiplying the units with zero.
It uses mini batch, splits the training data.
The different classifiers are obtained by removing different neurons from the network.
Binary mask (0/1 stayed or discarded neurons) is applied to all input and hidden units.
For each binary mask, the predictions are calculated and in order to get a final prediction the mean of all the masks are calculated.
By using geometric mean, we can approximate the ensemble prediction by just 1 forward pass.
Weight Scaling inference: Multiplying the weights going out from a unit with a probability of including that unit
what is adversarial training?
It can be thought of a data augmentation technique where, the noise is related to the gradient of the network. This noise is applied to the examples, eventhough generally the human eye wont be able to see the difference we want the network to wrongly classify those examples, and confusing the network