Chapter 6 - Shrinkage Methods Flashcards
Overview (idea, why do we want to do this?)
The idea is to perform a linear regression, while regularizing or shrinking the coefficients towards 0 (add a penalty for large B)
Why are shrunk coefficients better? introduces bias, but can seriously reduce the variance of the estimates. There are bayesian motivations to do this: the prior tends to shrink the parameters
Ridge Regression (definition, 2 notes)
min(RSS) + lambda*sum l2 norm (beta_j^2). lambda allows you to modularize to find best fit between coefficient values and min of RSS. solve this problem for every lambda, then use CV error to choose the correct value of lambda.
NOTE: fortunately, there are efficient ways to solve for all lamdba simultaneously. Also, in least squares regression we can scale variables without effecting the regression, which is not true for ridge.
Solution: scale each variable to variance 1 before running the regression.
The Lasso
same form as ridge regression, except for lambda modifies the l1 norm (sum of absolute values of beta_j).
1) assumption is that beta_j = 0.
2) All coefficients go to zero as lambda increases
the lasso can have piecewise function, viewed in phases (#of predictors = 1, …)
Comparison of coefficients in Ridge and Lasso
1) In ridge all coefficients will be non-zero
2) in lasso, some coefficients will be zero (can interpret this as those predictors aren’t important)
Alternative formulation of Ridge, Lasso (best subset) (2 notes)
Ridge: for every value of lambda, there is a shrinkage factor s such that Beta_R_lambda can be solved for a restricted optimization (l2 norm < s)
Lasso: for every lambda, there is a shrinkage factor that can be solved for a restricted optimization (l1 norm < s)
Best subset selection: finding least squares solution where at most s coefficients > 0 (i.e., the l0 norm)
Notes: lasso is a modification of best subset, ridge is a generalization of best subset
Visualizing Ridge and Lasso w/ two predictors (3)
plot beta1 vs beta2, contours are RSS contours. point at which {lasso diamond, ridge circle} and RSS contour meet is minimized coefficients (you can see that ridge solution will never have coefficients = 0)
When is lasso better than ridge? 2 examples
example 1: most of the coefficients are non-zero
- bias is about the same for both methods, but variance of ridge regression is smaller, so is MSE
example 2: only 2 coefficients are non zero
- bias for ridge regression is much higher than for lasso, variance is pretty similar, bias dominates so lasso is better
How do you choose lambda by CV?
plot CV error vs l1 norm/least squares norm (a constant). at that value of lambda only 2 predictors are positive ( the “true variables”)
Special case
n = p and matrix of predictors X is an identity matrix. you can simplify the minimization equations for ridge and lasso (piecewise) and make illustrative plots.
Bayesian Interpretation (2 plots)
Ridge: beta_R is posterior mean, with a normal prior on beta (looks like a gaussian bell curve centered on 0 [i.e., man coefficients will be slightly non-zero, few will be actually zero]).
Lasso: beta_L is posterior mode, with a laplace prior on beta (high inflection point at zero, quickly slopes down [i.e., most of the coefficients are zero, with very few being non-zero])