Week 5: Regularization: Lasso and Ridge regression Flashcards

1
Q

What does explicit regularisation amount to?

A

Modifying the cost function. ????????????????

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the idea behind regularisation?

A

(parametric model) To keep the parameters theta.hat small, unless the data really convinces us otherwise.

I.e., if a modlel with small values of the parameters theta.hat fits the data almost as well as a model with larger paramater values, the one with small parameter values should be preferred.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Is Ridge regression the L1 och L2 regularisation?

A

The L2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which values can lambda in take on in Ridge regr.?

A

lambda is greater or equal to zero (non-neg.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the trade-off that lambda (regularisation parameter) in Rigde regression expresses?

A

Choosing lambda is a trade-off between the original cost function (fitting the training data as well as possible) and the regularisation term (keeping the parameters theta.hat close to zero).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does it imply for the (linear regression) model if we set lambda equal to zero in both Ridge and LASSO?

A

That we will have the original least squares problem, as the penalty term disappears.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does it imply for the model (paramaters) if we let lambda go to infinity?

A

That we will force all paramaters to zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we choose the value of lambda in Ridge?

A

By cross-validation. Then selecting the lambda resulting in the lowest variance (MSE?).

It is possible to derive a version of the normal equations for the minimization problem in Ridge (the cost function).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference between the regularized cost functions of Ridge and LASSO?

A

That we always have a close-form solution to the Ridge regression NE if lambda > 0 (since then X^TX + n lambda I_p+1 is then invertible).

For LASSO, we have to use numerical optimization to solve the cost function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does it mean that LASSO favours sparse solutions?

A

It means that it tends to choose a lambda resulting in only a few of the parameters to be non-zero, and the rest to be exactly zero. It does so by “switching off” some inputs (by setting the corresponding parameter theta_k to zero).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does it imply for LASSO that it favours sparse solutions?

A

That, by “switching off” some inputs (by setting the corresponding parameter theta_k to zero), it can be used as an input (or feature) selection method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In practice, looking at a graph, what will be the difference between LASSO and Ridge?

A

That the Ridge (L2) will fit a model with small (but non-zero) coefficient values, whereas LASSO (L1) will fit a model where some coefficients are exactly zero and the rest are small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the goal of regularization in terms of E_new and E_train?

A

That the ultimate goal of explicit regularization is to slightly decrease E_train, hoping that it will generate a higher E_new.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do we, generally, more often use regularization methods instead of dim. reduction techniques for categorical variables?

A

Because categorical variables aren’t well suited for dimension reduction techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is, usually, the intercept theta_zero not included in the penalty term?

A

Because we want to discipline the model of variation around average y.bar, not y.bar itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Do we have to deman and rescale the inputs before performing regularization?

A

Yes, always.

17
Q

Which algorithm (numerical opt.) do we use when solving the NE for LASSO?

A

The Least angle regression (LARS) algorithm.

18
Q

What is the shrinkage factor in Ridge?

A

(1 + lambda)^-1.

We divide the LS.theta.hat_j by this factor to get the RIDGE.theta.hat_j.

19
Q

What will be the value of the LASSO coefficients if the LS coefficients are smaller than lambda (to begin with)?

A

Exactly zero.

20
Q

What do we try to minimize with an elastic net?

A

min (theta) * (1/n) || x theta - y ||^2_2 + lambda [(1-alpha) || theta || ^2_2 + (alpha) || theta ||_1

21
Q

What will lasso do if we have a model with only two, correlated input variables?

A

Arbitrarily set one coefficient to zero and hence exclude one of the variables.

22
Q

What is the elastic net?

A

It is a mixture of the lasso and ridge regularization. It poses a penalty that combines the two individual penalties. EN has an additional parameter (alpha).

23
Q

Explain the procedure of finding the ridge or lasso regression coefficients in linear regression using the coefficient budget perspective.

A

1) Find LS-coefficients
2) Draw contour lines
3) Set up budget in shape of circle or diamond
4) Extend contour lines until a contour barely touches the line of the budget constraint. The first line that barely touches is the combination of coefficients for your regularization model.

24
Q

What is the shape of the coefficient budget for lasso and ridge respectively? For p = 2? For p > 2?

A

For lasso its a diamond (for p > 2 a rhomboid) and for ridge its a circle (for p > 2 a sphere).

25
Q

What does the regularization path tell us that the orthogonal design doesn’t (for lasso)?

A

That lasso quickly results in a sparse model. Also which variables that are particularly important for predicting our output in the lasso model.

26
Q

What is the idea of the one S.E. (std error) rule?

A

To use it with cross validation (CV) and to select the most parsimonious model whose prediction error is not much worse than the minimum CV error. The chosen model will still be inside of one S.E. of the error.

27
Q

Which of ridge and lasso is a little bit better at reducing variance for a model where most variables are useful?

A

Ridge.

28
Q

Which of ridge and lasso is a little bit better at reducing variance for a model with many useless variables?

A

Lasso, as it can exclude variables fully by letting their coefficients equal zero.

29
Q

Explain how regularization affects bias and variance of a common linear regression model.

A

Without reg: low bias using training data → high variance.

With reg: a little larger bias to start off with → lower variance.
Regularization with man

30
Q

Which values can lambda take on?

A

Zero to infinity.

31
Q

Regularization for SIMPLE linear regression (using sq. error loss) boils down to what for i) ridge and ii) lasso?

A

To minimizing 1) sum of squared residuals plus lambda times the slope coefficient squared, 2) sum of squared residuals plus lambda times the absolute value of the slope coefficient.

32
Q

What is lambda?

A

A tuning parameter/a weight for trade-off.

33
Q

By how much do the ridge (L2) coefficients (linear reg) decrease?

A

By (1+lambda)^(-1)

34
Q

Why is it called L1 and L2 respectively?

A

Because we’re using the L1 and L2 norms respectively when setting up the penalty function.

35
Q

Name the two methods for graphically showing regularization.

A

1) Orthogonal design and 2) coefficient budget perspective

36
Q

Regularization for SIMPLE linear regression (using sq. error loss) boils down to what for i) ridge and ii) lasso? Explain as simple as possible.

A

To minimizing 1) sum of squared residuals plus lambda times the slope coefficient squared, 2) sum of squared residuals plus lambda times the absolute value of the slope coefficient.

37
Q

Draw the orthogonal design with some examples.

A

Set up 45 degree dotted line = would correspond to LS betas on y and x. Rotate for ridge and parallel for lasso.

38
Q

For adapt. Lasso, what will be the weight (strength of penalty) of an input with LS coefficient of 0.9 and 0.1 respectively?

A

1/0.9 = 0.1 and 1/0.1 = 10.

39
Q
A