Session 7: Regularised Regression Flashcards

Question

We try to find a lambda(penalty) that does what?

Answer 1

Minimises error of unseen cases If lambda is 0 then we have an OLS method and we have a bias of 0 if increase lambda then bias becomes much larger and MSE becomes larger but on other hand if our ordinary squared method or variance might be large but if increase lambda is becomes smaller

Answer 2

1. Ridge penalty 𝑓(𝛽)= ∑𝛽^2 sum of squared coefficients (∑𝛽2) forms the penalty Also called L2 norm as squared 2. LASSO (Least Absolute Shrinkage and Selection Operator): 𝑓(𝛽)= ∑|𝛽| sum of absolute coefficients (||) forms the penalty Also called L1 norm as its beta to the power of one 3. Elastic net – a combination of L1 and L2 norm r egularization

Answer 3

Ridge regression

Answer 4

By minimising root sum of squares plus our penalty which is lambda times sum of squared regression coefficients, this is penalty term which we add to our root RSS and we now call this RSS The parameter λ scales the norm - controls the amount of penalty

Answer 5

To choose the right value of λ

Answer 6

Variable selection Finding a small subset of most predictive variables in a high dimensional dataset is an interesting and important problem

Answer 7

Tends to assign zero coefficients to most irrelevant or redundant variables - This is also called a sparse solution

Answer 8

Minimised RSS plus lasso penalty term which is lambda times absolute value of regression parameters This is called L1 penalty/norm Lasso penalty involves absolute values of regression parameters and not sum of the squared values like in ridge regression Need to find best lambda to minimise penalised root mean squared error Similar to ridge regression, the penalty parameter (λ) controls the amount of penalty (user customisable)

Answer 9

Standardised so those with large range do not dominate model selection: Different units (m versus km) would result in different solutions This is automatically done in most software packages R packages such as “Glmnet” back transforms final regression coefficient on original scale!

Answer 10

Linear transformation of values to common mean of zero and stand deviation of 1: zi = (𝑥𝑖−𝑥 ̅)/𝑠 with 𝑧𝑖=𝑧 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚e𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑐𝑎𝑠𝑒 𝑖 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 𝑥 ̅=𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 𝑥𝑖=𝑜𝑟𝑔𝑖𝑛𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑐𝑎𝑠𝑒 𝑖 𝑠=𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒

Answer 11

FALSE z-transformation does not change the form of the distribution, it only adjusts the mean and the standard deviation!

Answer 12

Goal is to evaluate the model in terms of it’s ability to predict future observations: The model need to be evaluated on a dataset that was not used to build the model (test sets) We assess different lambdas and choose the one which predicts best unseen cases using cross-validation This best lambda is used fit the model using the complete data set Take average MSE, and calculate mse of 100 different lambdas of different strength and pick lamda with smallest MSE We pick the lambda which best predicts unseen cases (= smallest Mean squared error, MSE)

Answer 13

A loss function for penalizing error in prediction.

Answer 14

How good a prediction model does in terms of being able to predict the expected outcome..

Answer 15

MSE loss function We decide to choose the function f(x) which minimizes the expected loss or here the expected mean squared prediction error (MSE). The expected MSE can be estimated by cross-validation or bootstrapping methods - We use the same methodology as for internal validation!

Answer 16

Cross-validation to build a good prediction model The idea is to build a large number of alternative models (of varying complexities) and evaluate the predictive performance using cross-validation to select the best model In regularized regression we compare models with different lambdas!

Answer 17

Using CV to select optimal 𝜆 selects the best set of predictors of unseen cases. However: Prediction accuracy measures are over-optimistic estimates for accuracy of future sample: CV test data were used to select our model!

Answer 18

Parsimonious model selection

Answer 19

Does not encourage the β coefficients to be exactly zero Not good for variable selection Not good for sparse problems Alternative penalised methods (e.g., LASSO, see next) is a better option for variable selection

Answer 20

This is a slightly stronger penalty than the minimum lambda and lies within one standard error of the optimal value of lambda. The purpose of regularization is often to balance accuracy and simplicity: We want a model with the smallest number of predictors that also gives a good accuracy. Setting lambda = lambda.1se results in a simpler model compared to lambda.min (less variables are selected), but the model might be a little bit less accurate than the one obtained with min.lambda. Research suggest that this lambda sometimes predicts better in external data sets and selects less false positive predictors.

Answer 21

If you have large sample sizes with a relative small number of variables of likely predictors (theory-driven)

Answer 22

If you expect many small effect sizes and predictors are likely true ones (you want to keep all variables in the model).

Answer 23

If you have a few stronger predictors among a large number of likely weak predictors or noise variables.

Answer 24

Statistical Inference of regression coefficients This is because the penalised estimates are biased towards zero Standard Error (SE) of penalised coefficients give only partial information of the precision SE ignores the inaccuracy caused by bias Software packages do not supply standard errors (SE), confidence interval (CI), or p-values for penalised regression. Internal validation is our “test” Major aim in penalised regression is to build a prediction model/variable selection rather than performing statistical inference

Answer 25

Minimise the sum of the squared error (or MSE) of the model on the training data but also to try avoiding over-fitting by reducing the complexity of the model at the cost of some bias This is done by shrinking the regression coefficients

Answer 26

Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization). Lasso Regression: where OLS is modified to also minimize the absolute sum of the coefficients (L1 regularization). Unlike Ridge, Lasso regression performs variable selection by shrinking some coefficients to 0

Session 7: Regularised Regression Flashcards

(51 cards)