Stepwise Selection Flashcards
Definition
A statistical method to simplify a model by selecting only those factors that have predictive power
Best subset selection is performed by fitting all p models, where p is the total number of predictors being considered, that contain exactly one predictor and picking the model with smallest deviance, fitting all p choose 2 models that contain exactly 2 predictors and picking the model with lowest deviance, and so forth. Then a single best model is selected from the models picked, using a metric such as AIC. In general, there are 2p models that are fit, which can be quite a large search space as p increases.
Notes - Information Criteria
Criteria are u-shaped as a function of flexibility; lower values are preferred. Compares value added by predictors against the number of predictors
p= # predictors in the model; n= # training observations
Akaike Information Criterion or AIC: SSE* + 2p
-For every additional predictor added to the model, AIC experiences a net decrease when its first term decreases by more than 2
-adding a variable requires an increase in the loglikelihood of two per
parameter added.
Bayesian Information Criterion BIC: SSE* + ln(n)p
-For reasonably sized training sets, ln(n) would exceed 2.
-For every additional predictor added to the model, BIC experiences a net decrease when its first term decreases by more than ln(n)
-the required per parameter increase is the logarithm of the number ofobservations.
-BIC is expected to reach its minimum at a p that is smaller compared to AIC; models with fewer predictors are favored when using BIC
SSE* (can be referred to as deviance) is a similar quantity to the SSE. As p increases, SSE* decreases.
Want lowest AIC/BIC!!
Model Performance
Want lowest AIC/BIC
Compare to other models using test RMSE or another performance metric.
Notes
Obeys the hierarchical principle
Factors:
–Default, does not consider dummy variables as separate factors
–Use binarization in order for stepwise selection to view them as unrelated, standalone predictors, such that they may be added/dropped individually.
Drop1 -> 1st round of backward with AIC
Forward Selection
Starting with the NULL model (no predictors), forward selection adds the best predictor, doing so one at a time until it yields no further improvement; ‘best’ is based on the chosen information criterion.
Each iteration compares the current model with each +1 predictor option available and picks the best one (the one that produces the lowest AIC/BIC).
Backward Selection
Starting with the biggest model (all predictors), backward selection drops the worst predictor, doing so one at a time until it yields no further improvement; ‘worst’ is based on the chosen information criterion.
Each iteration compares the current model with each -1 predictor option available and picks the best one (the one that produces the lowest AIC/BIC). Interaction term must be dropped before those individual predictors can be evaluated for dropping.
Procedure Comparisons
Both perform variable selection. Avoids overfitting to the data, especially when the number of observations is small compared to the number of predictors.
Both are greedy, unlike best subset selection which checks all possible predictor combinations to find the best one.
An advantage of forward selection is that it works in high-dimensional settings.
An advantage of backward selection is that it maximizes the potential of finding complementing predictors.
Forward selection with BIC tends to result in fewer predictors; backward selection with AIC tends to result in more predictors.
vs Best Subset
-Stepwise selection is an alternative to best subset selection, which is computationally more efficient, since it considers a much smaller set of models. It is not guaranteed that stepwise selection will find the best possible model out of all 2p models.