Regularized Linear Models Flashcards
DS
Name and briefly explain three regularized models.
Ridge Regression: linear regression that adds L2-norm (sum of squared coefs) penalty/regulation terms to the cost function.
Lasso Regression: linear regression that adds L1-norm (sum of absolute value coefs) penalty/regulation to the cost function, and does feature selection.
Elastic Net: linear regression that adds mix of both L1 and L2 norm penalties to the cost function.
What hyperparameters can be tuned in linear models? Explain how they affect model building.
You can tune the weight of the regularization term for regularized models (typically denoted alpha), which affect how much the models will compress features.
alpha=0 –> regularized model is identical to linear regression
alpha=1 –> regularized model reduces the original model to a constant value.
When should you use no regularization vs. Ridge vs. Lasso vs. ElasticNet?
Regularized models tend to outperform nonregularized models, so it is suggested that you at least try using ridge regression.
Lasso can be effective when you want to do automatic feature selection in order to create a simpler model but can be dangerous since it may be erratic and remeove features that contain useful signal.
ElasticNet is a blend of ridge and lasso, and it can be used to the same effect as lasso whith less erratic behavior.
Explain how the penalty term works in regularized regression.
Say we have two points in train set. We can perfectly fit a line through these two points. However, this fit is very likely overfit (high variance, low bias). When lambda=0, we get the same overfit line, but as lambda approaches infinity, the slope becomes flatter (lowers variance, increases bias).
THE WHOLE POINT OF REGULARIZED REGRESSION IS TO IMPROVE ML RESULTS WHEN USING SMALL SAMPLE SIZES THAT RESULT IN TRAINING SET OVERFITS, BY DESENSITIZING/REDUCING THE FLEXIBILITY OF THE MODEL.
Simple linear regression loss function is SumSquaredResiduals = SSR.
Ridge regression loss function is SumSquaredResiduals + lambda*(slope^2).
KEY 1: By minimizing SSR * lambda(slope^2) we get a BIASED FIT to the TRAINING data, which is an IMPROVED fit to the TEST data.
HOW? As lambda increases, the error increases and the SLOPE BECOMES FLATTER (slope asymptotically approaches 0, horizontal line as lambda -> inf), which effectively adds bias to model, i.e. reduces flexibility of the model w.r.t. the training data.
example for two data points, where SS residuals is zero:
linreg: y = slope*x + b y = b0 + b1*x y = 0.4 + 1.3x; minimize linreg loss: SSR = 0 because the line overlaps the two points
(lambda = 1) minimize ridge loss subject to:
SSR + lambda(b1^2)
= SSR + lambda(1.3^2)
= 0 + 1 * (1. 69)
(lambda = 2) minimize ridge loss subject to:
SSR + lambda(b1^2)
= SSR + lambda(1.3^2)
= 0 + 2* (1. 69)
= 3.38
… the best fit becomes flatter (introduces bias) as lambda approaches inf, which increases the SSR+ridge loss
KEY 2: we apply cross validation to find optimal lambda to introduce optimal bias
Ridge penalty shrinks the slope (some coefs) asymptotically to zero while Lasso penalty shrinks the slope (some coefs) exactly to zero.
Ridge is better when most variables are useful while Lasso is better when there are many useless noise variables.