Shrinkage and regularization Flashcards
Shrinkage estimation
“Roughly speaking, maximum likelihood provides nearly unbiased estimates of nearly mini-
mum variance, and does so in an automatic way.
Again speaking roughly, unbiasedness can be an unaffordable luxury
when there are hundreds or thousands of parameters to estimate at the same
time. The James–Stein estimator made this point dramatically in 1961. It begins the story of shrinkage estimation, in which deliberate biases are introduced to improve overall performance, at a possible danger to individual estimates”
The James–Stein Estimator
“The James–Stein estimator is a statistical method for improving the estimation of a vector of means in high-dimensional settings. It is particularly notable because it demonstrates that the traditional maximum likelihood estimator (MLE) can be outperformed in terms of mean squared error (MSE) under certain conditions.
The key insight is that in high-dimensional settings (p>2p>2), the MLE has high variance. By shrinking the estimates toward a central point, the James–Stein estimator reduces the variance more than it increases the bias, leading to an overall improvement in MSE.
The JS estimator has smaller expected squared risk compared to MLE estimate when N >= 4:
Baseball players example”
Ridge regression
”
also known as l2-norm! (does not shrink to 0)
Linear regression, perhaps the most widely used estimation technique, is
based on a version of mu-MLE.
Ridge regression is a shrinkage method designed to improve the estima-
tion of Beta in linear models. By transformations we can standardize (7.28) X
so that the columns of X each have mean 0 and sum of squares 1.
This puts the regression coefficients on comparable scales.
Ridge regression amounts to an increased prior belief that Beta lies near 0.
We have (deliberately) introduced bias, and the squared bias term coun-
teracts some of the advantage of reduced variability
In such situa-
tions the scientist is often looking for a few interesting predictor variables
hidden in a sea of uninteresting ones: the prior belief is that most of the Beta
values lie near zero. Biasing the maximum likelihood estimates Beta toward
zero then becomes a necessity.
Pitfall of shrinkage estimation:
Shrinkage estimators work against cases
that are genuinely outstandin
The L2 norm of a vector xx is the square root of the sum of the squares of its components. It is also known as the Euclidean norm, as it represents the straight-line (Euclidean) distance from the origin to the point xx in space.”
Best-subset regression
“The idea is to build a model using a subset of the variables; in fact the smallest subset that ade-
quately explains the variation in the response is what we are after, both for
inference and for prediction purposes.
Step 3 is easy to state, but requires a lot of computation p much larger than about 40 it becomes prohibitively expensive to perform exactly—a so-called “N-P complete” problem because of its combinatorial complexity (there are 2^p subsets).
As a result, more manageable stepwise procedures were invented.”
Forward stepwise regression
“is a simple modification of best-subset, with the modification occurring in step 3.
It starts with the null model, here an intercept, and adds variables one at a time. Even with large p, identifying the best variable to add at each step is manageable, and can be distributed if clusters of machines are available.”
Lasso regression
“L1 norm: selection AND shrinkage.
A big difference, however, is that for the lasso, the solution typically has
many of the ˇj equal to zero, while for ridge they are all nonzero.
Hence the lasso does variable selection and shrinkage, while ridge only shrinks.
Since the constraint in the lasso treats all the coefficients equally, it usu-
ally makes sense for all the elements of xto be in the same units. if not, we typically standardize the predictors beforehand so that each has variance one.
The L1 norm of a vector xx is the sum of the absolute values of its components. It is also known as the Manhattan norm or Taxicab norm, as it represents the distance you would travel in a grid-like city (e.g., Manhattan).”