Week 5: Regularization: Lasso and Ridge regression Flashcards
What does explicit regularisation amount to?
Modifying the cost function. ????????????????
What is the idea behind regularisation?
(parametric model) To keep the parameters theta.hat small, unless the data really convinces us otherwise.
I.e., if a modlel with small values of the parameters theta.hat fits the data almost as well as a model with larger paramater values, the one with small parameter values should be preferred.
Is Ridge regression the L1 och L2 regularisation?
The L2.
Which values can lambda in take on in Ridge regr.?
lambda is greater or equal to zero (non-neg.)
Describe the trade-off that lambda (regularisation parameter) in Rigde regression expresses?
Choosing lambda is a trade-off between the original cost function (fitting the training data as well as possible) and the regularisation term (keeping the parameters theta.hat close to zero).
What does it imply for the (linear regression) model if we set lambda equal to zero in both Ridge and LASSO?
That we will have the original least squares problem, as the penalty term disappears.
What does it imply for the model (paramaters) if we let lambda go to infinity?
That we will force all paramaters to zero.
How do we choose the value of lambda in Ridge?
By cross-validation. Then selecting the lambda resulting in the lowest variance (MSE?).
It is possible to derive a version of the normal equations for the minimization problem in Ridge (the cost function).
What is the difference between the regularized cost functions of Ridge and LASSO?
That we always have a close-form solution to the Ridge regression NE if lambda > 0 (since then X^TX + n lambda I_p+1 is then invertible).
For LASSO, we have to use numerical optimization to solve the cost function.
What does it mean that LASSO favours sparse solutions?
It means that it tends to choose a lambda resulting in only a few of the parameters to be non-zero, and the rest to be exactly zero. It does so by “switching off” some inputs (by setting the corresponding parameter theta_k to zero).
What does it imply for LASSO that it favours sparse solutions?
That, by “switching off” some inputs (by setting the corresponding parameter theta_k to zero), it can be used as an input (or feature) selection method.
In practice, looking at a graph, what will be the difference between LASSO and Ridge?
That the Ridge (L2) will fit a model with small (but non-zero) coefficient values, whereas LASSO (L1) will fit a model where some coefficients are exactly zero and the rest are small.
What is the goal of regularization in terms of E_new and E_train?
That the ultimate goal of explicit regularization is to slightly decrease E_train, hoping that it will generate a higher E_new.
Why do we, generally, more often use regularization methods instead of dim. reduction techniques for categorical variables?
Because categorical variables aren’t well suited for dimension reduction techniques.
Why is, usually, the intercept theta_zero not included in the penalty term?
Because we want to discipline the model of variation around average y.bar, not y.bar itself.