L6: Model Selection and Regularisation Flashcards

Question 1

Q

If the # of variables (p) in X is quite large, that is, n < p, what does this do to our resulting model?

Answer

A

We will have infinite variance and no unique OLS coefficient estimate

Question 2

Q

We have many variables in X, removing the variables that are unrelated to the Y will do what for the model?

Answer

A

Setting the coefficients of these variables to 0 will reduce the unnecessary complexity of the model and increases the interpretability.

Question 3

Q

What is the method solution for removing the unrelated variables?

Answer

A

Feature Selection

Question 4

Q

What are the three classes of selection methods?

Answer

A

Subset Selection
Shrinkage (regularisation)
Dimension Reduction

Question 5

Q

Which algorithm for subset selection does this represent?

Answer

A

Best Subset Selection

Question 6

Q

Why is Best Subset Selection impractical in some cases?

Answer

A

Because the algorithm will cycle through all sizes possible of the model. Hence, if p, the # of variables in X is quite large, this can be computationally expensive. Also, this can lower the training error in overfitting which does not generalise well to the test dataset.

TLDR: Large p = costly computations and overfitting causes low training error

Question 7

Q

What is the pseudo-algorithm for Forward Stepwise Selection?

Answer

A

Like dis.

TDLR:

Denote null model
For k = 0, 1, … p-1
a. Consider all p-k models that augment the predictors in Mk with one additional predictor
b. Choose the best amongst these p-k models (smallest RSS or highest R²)
Select single best model from among M0…Mp using prediction error or R²

Question 8

Q

It is not possible to find the best possible model out of all 2^p models using Best Subset Selection. Why?

Answer

A

Because a the best model may be a combination of the latter variables, rather than including the first.

E.g. In a model with X1, X2, X3, the best 1-variable contains X1 only. The best 2-variable model contains X2 and X3. The forward stepwise selection would then necessarily include X1 and either X2 or X3 for its 2-variable models. Which would not be optimal.

Question 9

Q

When high-dimensional settings are present (p > n), why is Forward Stepwise Selection still able to be used?

Answer

A

Because we can adjust the submodel selection to be from M0 to Mn rather than Mp

Because if p > n then we have no unique solution, so restrict the # of variables in the submodels so that unique solutions can be found.

Question 10

Q

What is Backward Stepwise Selection?

Answer

A

It is the same as Forward Stepwise Selection, but it starts with all variables and reduces to zero to find the best model.

Question 11

Q

Why is R squared not suitable for selection the best model out of models with different numbers of predictors?

Answer

A

The R squared will increase monotonically as the no. of features included in the models increases.

Further, these measurements are related to the training error and may represent over-fitting.

Therefore we should have a validation set.

Question 12

Q

What are suitable measurements for model performance?

Answer

A

Mallow’s Cp

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Adjusted R Squared

These methods add a penalty for the use of extra variables. Hence we will get the best model that also has an optimally reduced no. of variables (complexity).

Question 13

Q

What are suitable measurements for model performance?

Answer

A

Mallow’s Cp

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Adjusted R Squared

These methods add a penalty for the use of extra variables. Hence we will get the best model that also has an optimally reduced no. of variables (complexity).

Question 14

Q

What are the two Shrinkage methods that we cover?

Answer

A

Lasso and Ridge Regression

Question 15

Q

How does shrinkage differ from subset selection?

Answer

A

Rather than removing the variables completely, in shrinkage the coefficients of the variables are ‘shrunk’ to zero for the variables that are deemed to be less important.

Question 16

Q

What does ridge regression do?

Answer

Study These Flashcards

A

Ridge regression implements a regularisation parameter and a shrinkage penalty.

These together will shrink the estimates of B (parameters) to zero and reduce the model complexity.

When lambda = 0, the ridge regression will produce least squares estimates for B. When lambda = infin. B will approach 0.

Question 17

Q

How is ridge regression an improvement on Ordinary Least Squares?

Answer

Study These Flashcards

A

For p ≈ n and p > n

OLS estimates are extremely variable
Ridge regression performs well by trading off a small increase in bias for a large decrease in variance

Question 18

Q

Why is ridge regression more computationally effective than Best Subset Selection?

Answer

Study These Flashcards

A

Ridge regression modifies a single model with a new integrated parameter, lambda. Hence, only one model needs to be trained.

Question 19

Q

How does LASSO differ from ridge regression?

Answer

Study These Flashcards

A

In RR, the coefficients approach zero.

In LASSO, the coefficients will be set to equal zero when some threshold for lambda is met (sufficiently large).

Question 20

Q

We have a small no. of predictors that have substantial coefficients, which of LASSO and ridge regression will perform better?

Answer

Study These Flashcards

A

LASSO will perform better in this instance

Question 21

Q

We have a function of many predictors each with roughly equal size, which of LASSO and ridge regression will perform better?

Answer

Study These Flashcards

A

Ridge regression will perform better in this instance

Question 22

Q

Elastic net is improved on LASSO by which two of LASSO’s limitations?

Answer

Study These Flashcards

A

If p > n, the lasso selects at most n variables

If we have grouped variables, lasso fails to make grouped selection

Question 23

Q

How does Elastic Net regression work?

Answer

Study These Flashcards

A

The Elastic Net method will combine the strengths of LASSO and Ridge Regression by integrating both the shrinkage penalty and the variable selection penalty.

These are done simultaneously. In doing so, the Elastic Net regression will group and shrink the parameters associated with the correlated variables and leaves them in the equation or removes them all at once.