Chapter 6 - Shrinkage Methods Flashcards

1
Q

Overview (idea, why do we want to do this?)

A

The idea is to perform a linear regression, while regularizing or shrinking the coefficients towards 0 (add a penalty for large B)

Why are shrunk coefficients better? introduces bias, but can seriously reduce the variance of the estimates. There are bayesian motivations to do this: the prior tends to shrink the parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Ridge Regression (definition, 2 notes)

A

min(RSS) + lambda*sum l2 norm (beta_j^2). lambda allows you to modularize to find best fit between coefficient values and min of RSS. solve this problem for every lambda, then use CV error to choose the correct value of lambda.

NOTE: fortunately, there are efficient ways to solve for all lamdba simultaneously. Also, in least squares regression we can scale variables without effecting the regression, which is not true for ridge.
Solution: scale each variable to variance 1 before running the regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The Lasso

A

same form as ridge regression, except for lambda modifies the l1 norm (sum of absolute values of beta_j).

1) assumption is that beta_j = 0.
2) All coefficients go to zero as lambda increases

the lasso can have piecewise function, viewed in phases (#of predictors = 1, …)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Comparison of coefficients in Ridge and Lasso

A

1) In ridge all coefficients will be non-zero

2) in lasso, some coefficients will be zero (can interpret this as those predictors aren’t important)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Alternative formulation of Ridge, Lasso (best subset) (2 notes)

A

Ridge: for every value of lambda, there is a shrinkage factor s such that Beta_R_lambda can be solved for a restricted optimization (l2 norm < s)

Lasso: for every lambda, there is a shrinkage factor that can be solved for a restricted optimization (l1 norm < s)

Best subset selection: finding least squares solution where at most s coefficients > 0 (i.e., the l0 norm)

Notes: lasso is a modification of best subset, ridge is a generalization of best subset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Visualizing Ridge and Lasso w/ two predictors (3)

A

plot beta1 vs beta2, contours are RSS contours. point at which {lasso diamond, ridge circle} and RSS contour meet is minimized coefficients (you can see that ridge solution will never have coefficients = 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When is lasso better than ridge? 2 examples

A

example 1: most of the coefficients are non-zero
- bias is about the same for both methods, but variance of ridge regression is smaller, so is MSE

example 2: only 2 coefficients are non zero
- bias for ridge regression is much higher than for lasso, variance is pretty similar, bias dominates so lasso is better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you choose lambda by CV?

A

plot CV error vs l1 norm/least squares norm (a constant). at that value of lambda only 2 predictors are positive ( the “true variables”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Special case

A

n = p and matrix of predictors X is an identity matrix. you can simplify the minimization equations for ridge and lasso (piecewise) and make illustrative plots.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Bayesian Interpretation (2 plots)

A

Ridge: beta_R is posterior mean, with a normal prior on beta (looks like a gaussian bell curve centered on 0 [i.e., man coefficients will be slightly non-zero, few will be actually zero]).

Lasso: beta_L is posterior mode, with a laplace prior on beta (high inflection point at zero, quickly slopes down [i.e., most of the coefficients are zero, with very few being non-zero])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly