Linear Model Selection & Regularization Flashcards
Feature Selection for Linear Regression
- Best subset selection, try combinations of all features (not feasible, just a theoretical optima). *Unless you have a really long time is ok
- Forward Stepwise Selection, start with no var, keep adding one var at a time that maximizes fit. Then select the best of these models using CV error.
- Backward Stepwise Selection - start with all vars, keep taking them away one by one. “” CV Error.
- Regularization Techniques (Ridge, Lasso)
- Feature Selection - Lasso even though is a regularization technique, does Feature selection
Ways of selecting a model when you have tried a bunch of them
- Use CV (Preferred)
- Look at metrics such as AIC, BIC, etc or Adjusted Rsquared, which are goodness of fit metrics that pay a price for unnecessary variables. *Used these when computing power was in issue, nowadays use CV.
Regularization Techniques for Linear Regression
- Ridge - supply a regularization parameter that penalizes large coefficients (so better have a good one). Must be careful in selecting the regularization parameter. Also must center/scale variables prior to regularization.
- Lasso - regularizes some coefficients to exactly zero so also performs feature selection. (preferred). Must also carefully select reg. parameter and scale/center vars.
Disadvantage of Ridge Regression
Will include all variables P in the model, albeit will shrink coefficients of variables. This creates problems for model interpretation. Alternatively, Lasso will set some coefficients exactly to zero.
Ridge vs. Lasso
Lasso is better when many of the vars really have no relation (zero coeff) with the outcome var.
Ridge will outperform if many of the vars infact are related to the outcome var.
How to select tuning parameter
plot the K-fold CV Error against different values in the tuning parameter
Dimension Reduction vs. Regularization
Regularization involves controlling the coefficients (with lasso, even setting them to zero which has the effect of reducing dimensions). Dimension Reduction transform the predictors into a smaller set of variables.
Dimension Reduction
Transform dimensions into a smaller subset of linear combinations that substantially explain the variability of the all the original variables.
Uses of PCA
Can be used for dimension reduction or unsupervised learning. Creates linear combinations of predictors that are uncorrelated w/eachother.
Things to consider when doing PCA
center/scale prior to using PCA
Must select the # of Prin Components by plotting CV Error vs. # of components.
PCA vs Partial Least Squares
PCA creates new vars that substantially explain the variability in the orignal vars.
PLS creates new vars that explain variability, but ALSO that are related to the response, kind of like a supervised PCA.
Like PCA, in PLS you must choose your # of features and center/scale. PLS IS NOT SUPERIOR TO PCA they are kind of a wash.
What is high dimensionality
when p >= n. Linear and logistic regression doesn’t work well in thi sscenario.
Interpreting models in high dimensionality
In this sceanrio, you cannot be confident that you have selected the “best” model, only that you have a “good model”. Also traditional measures of model fit dont dont apply (Rsq, P value, etc.)
What gives you the most optimal subset of variables for a model (linear)
Best subset selection, available in the leaps library, regsubsets() function
When do you want to use best subset selection
Best subset selection is computationally expensive, however when p < 10, then usually is ok to do best subset