L6: Model Selection and Regularisation Flashcards

- Understand the algorithms, such as stepwise selection, ridge/lasso regression - Apply those algorithms to further improve the modelling accuracy. Unit learning outcomes: - Evaluate the limitations, appropriateness and benefits of data analytics methods for given tasks; - Design solutions to real world problems with data analytics techniques;

1
Q

If the # of variables (p) in X is quite large, that is, n < p, what does this do to our resulting model?

A

We will have infinite variance and no unique OLS coefficient estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

We have many variables in X, removing the variables that are unrelated to the Y will do what for the model?

A

Setting the coefficients of these variables to 0 will reduce the unnecessary complexity of the model and increases the interpretability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the method solution for removing the unrelated variables?

A

Feature Selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the three classes of selection methods?

A
  1. Subset Selection
  2. Shrinkage (regularisation)
  3. Dimension Reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which algorithm for subset selection does this represent?

A

Best Subset Selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is Best Subset Selection impractical in some cases?

A

Because the algorithm will cycle through all sizes possible of the model. Hence, if p, the # of variables in X is quite large, this can be computationally expensive. Also, this can lower the training error in overfitting which does not generalise well to the test dataset.

TLDR: Large p = costly computations and overfitting causes low training error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the pseudo-algorithm for Forward Stepwise Selection?

A

Like dis.

TDLR:

  1. Denote null model
  2. For k = 0, 1, … p-1
    a. Consider all p-k models that augment the predictors in Mk with one additional predictor
    b. Choose the best amongst these p-k models (smallest RSS or highest R2)
  3. Select single best model from among M0…Mp using prediction error or R2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

It is not possible to find the best possible model out of all 2p models using Best Subset Selection. Why?

A

Because a the best model may be a combination of the latter variables, rather than including the first.

E.g. In a model with X1, X2, X3, the best 1-variable contains X1 only. The best 2-variable model contains X2 and X3. The forward stepwise selection would then necessarily include X1 and either X2 or X3 for its 2-variable models. Which would not be optimal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When high-dimensional settings are present (p > n), why is Forward Stepwise Selection still able to be used?

A

Because we can adjust the submodel selection to be from M0 to Mn rather than Mp

Because if p > n then we have no unique solution, so restrict the # of variables in the submodels so that unique solutions can be found.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Backward Stepwise Selection?

A

It is the same as Forward Stepwise Selection, but it starts with all variables and reduces to zero to find the best model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is R squared not suitable for selection the best model out of models with different numbers of predictors?

A

The R squared will increase monotonically as the no. of features included in the models increases.

Further, these measurements are related to the training error and may represent over-fitting.

Therefore we should have a validation set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are suitable measurements for model performance?

A

Mallow’s Cp

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Adjusted R Squared

These methods add a penalty for the use of extra variables. Hence we will get the best model that also has an optimally reduced no. of variables (complexity).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are suitable measurements for model performance?

A

Mallow’s Cp

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Adjusted R Squared

These methods add a penalty for the use of extra variables. Hence we will get the best model that also has an optimally reduced no. of variables (complexity).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two Shrinkage methods that we cover?

A

Lasso and Ridge Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does shrinkage differ from subset selection?

A

Rather than removing the variables completely, in shrinkage the coefficients of the variables are ‘shrunk’ to zero for the variables that are deemed to be less important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does ridge regression do?

A

Ridge regression implements a regularisation parameter and a shrinkage penalty.

These together will shrink the estimates of B (parameters) to zero and reduce the model complexity.

When lambda = 0, the ridge regression will produce least squares estimates for B. When lambda = infin. B will approach 0.

16
Q

How is ridge regression an improvement on Ordinary Least Squares?

A

For p ≈ n and p > n

  1. OLS estimates are extremely variable
  2. Ridge regression performs well by trading off a small increase in bias for a large decrease in variance
17
Q

Why is ridge regression more computationally effective than Best Subset Selection?

A

Ridge regression modifies a single model with a new integrated parameter, lambda. Hence, only one model needs to be trained.

18
Q

How does LASSO differ from ridge regression?

A

In RR, the coefficients approach zero.

In LASSO, the coefficients will be set to equal zero when some threshold for lambda is met (sufficiently large).

19
Q

We have a small no. of predictors that have substantial coefficients, which of LASSO and ridge regression will perform better?

A

LASSO will perform better in this instance

20
Q

We have a function of many predictors each with roughly equal size, which of LASSO and ridge regression will perform better?

A

Ridge regression will perform better in this instance

21
Q

Elastic net is improved on LASSO by which two of LASSO’s limitations?

A

If p > n, the lasso selects at most n variables

If we have grouped variables, lasso fails to make grouped selection

22
Q

How does Elastic Net regression work?

A

The Elastic Net method will combine the strengths of LASSO and Ridge Regression by integrating both the shrinkage penalty and the variable selection penalty.

These are done simultaneously. In doing so, the Elastic Net regression will group and shrink the parameters associated with the correlated variables and leaves them in the equation or removes them all at once.