Chapter 6 - Model Selection Flashcards
Define model selection
In linear regression training error (RSS) decreases as p increases, and when n < p there is no least squares solution. so we must find a way to select fewer predictors
Best Subset Selection
overview: lets compare all models with k predictors (there are p choose k of these). Choose the model with the smallest RSS. Do this for every possible k. select an optimal k (# of predictors) by minimizing CV error
Alternatives to minimizing CV error (3)
minimize {1) Akaike Information Criterion 2) Bayesian Information Criterion, * both penalize models with extra predictors} maximize {1) Adjusted R^2 = 1 - (RSS/(n-k-1))/(TSS/(n-1))
How do the alternatives to computing CV error compare? (3)
1) much less computationally expensive 2) motivated by asymptotic arguments and rely on model assumptions (e.g, normality of errors) 3) equivalent concepts for other models (e.g., logistic regression)
2 problems with best subset selection, and how to mitigate them
1) very computationally expensive
2) if for a fixed k, there are too many possibilities, we increase our chance of overfitting (e.g., the model has a lot of variance (changes a lot between training sets))
Solution: restrict our search space for the best model (reduces model variance at the expense of higher bias)
Forward Stepwise Selection
1) start with the null model 2) augment the predictors in K with 1 additional predictor, choose among these the best possible predictor (minimizes RSS, highest R^2) 3) select a single best model using CV error, Cp, AIC, BIC, or adjusted R^2
note: results not the same as best subset!
Backward Stepwise Selection
1) start with full model {x1, …., xP}
2) at each step remove one of the predictors from the model, choose the one that causes the minimum jump in RSS (also remove model that is less significant in a t-test)
Comparison of Forward and Backward Stepwise Selection
you cannot apply backward selection when p>n, they don’t have to vie the same result
Alternatives Selection Methods
mixed - do forward selection, but at every step, remove variables that are no longer necessary
forward stagewise selection - modify/transform predictors after each setp (the span of X matters). decreases variance of procedure but bias increases