L6: Model Selection and Regularisation Flashcards
- Understand the algorithms, such as stepwise selection, ridge/lasso regression - Apply those algorithms to further improve the modelling accuracy. Unit learning outcomes: - Evaluate the limitations, appropriateness and benefits of data analytics methods for given tasks; - Design solutions to real world problems with data analytics techniques;
If the # of variables (p) in X is quite large, that is, n < p, what does this do to our resulting model?
We will have infinite variance and no unique OLS coefficient estimate
We have many variables in X, removing the variables that are unrelated to the Y will do what for the model?
Setting the coefficients of these variables to 0 will reduce the unnecessary complexity of the model and increases the interpretability.
What is the method solution for removing the unrelated variables?
Feature Selection
What are the three classes of selection methods?
- Subset Selection
- Shrinkage (regularisation)
- Dimension Reduction
Which algorithm for subset selection does this represent?
Best Subset Selection
Why is Best Subset Selection impractical in some cases?
Because the algorithm will cycle through all sizes possible of the model. Hence, if p, the # of variables in X is quite large, this can be computationally expensive. Also, this can lower the training error in overfitting which does not generalise well to the test dataset.
TLDR: Large p = costly computations and overfitting causes low training error
What is the pseudo-algorithm for Forward Stepwise Selection?
Like dis.
TDLR:
- Denote null model
- For k = 0, 1, … p-1
a. Consider all p-k models that augment the predictors in Mk with one additional predictor
b. Choose the best amongst these p-k models (smallest RSS or highest R2) - Select single best model from among M0…Mp using prediction error or R2
It is not possible to find the best possible model out of all 2p models using Best Subset Selection. Why?
Because a the best model may be a combination of the latter variables, rather than including the first.
E.g. In a model with X1, X2, X3, the best 1-variable contains X1 only. The best 2-variable model contains X2 and X3. The forward stepwise selection would then necessarily include X1 and either X2 or X3 for its 2-variable models. Which would not be optimal.
When high-dimensional settings are present (p > n), why is Forward Stepwise Selection still able to be used?
Because we can adjust the submodel selection to be from M0 to Mn rather than Mp
Because if p > n then we have no unique solution, so restrict the # of variables in the submodels so that unique solutions can be found.
What is Backward Stepwise Selection?
It is the same as Forward Stepwise Selection, but it starts with all variables and reduces to zero to find the best model.
Why is R squared not suitable for selection the best model out of models with different numbers of predictors?
The R squared will increase monotonically as the no. of features included in the models increases.
Further, these measurements are related to the training error and may represent over-fitting.
Therefore we should have a validation set.
What are suitable measurements for model performance?
Mallow’s Cp
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Adjusted R Squared
These methods add a penalty for the use of extra variables. Hence we will get the best model that also has an optimally reduced no. of variables (complexity).
What are suitable measurements for model performance?
Mallow’s Cp
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Adjusted R Squared
These methods add a penalty for the use of extra variables. Hence we will get the best model that also has an optimally reduced no. of variables (complexity).
What are the two Shrinkage methods that we cover?
Lasso and Ridge Regression
How does shrinkage differ from subset selection?
Rather than removing the variables completely, in shrinkage the coefficients of the variables are ‘shrunk’ to zero for the variables that are deemed to be less important.