Session 7: Regularised Regression Flashcards
What does the standard statistical approach assume?
A parametric model such as the linear model
π(πΏπ )= π½0+π½1 π1+β¦+π½π ππ
If we hypothesize a linear model for the response y
π¦= π½0+π½1 π1+β¦+π½πππ + πΊ
where πΊ =N(0, Ο2)
To estimate unknown parameter π½_0,π½_1,β¦π½_π we what?
Ordinary least square (OLS) methods:
We choose π½_0 , π½_1β¦, π½_π to minimize the residual sum of squares between observed and predicted responses of the (same) data set.
Statistical inference assuming normal distributing of the error π allows constructing confidence intervals around parameters and performing statistical tests.
What are problems with the OLS estimator?
There are situations when the OLS estimator does not work well:
When the independent variables are highly correlated- We get unstable, highly variable estimates
The number of parameters relative to the sample size is large - danger of overfitting
We optimise for an unbiased estimator of mean response E(y) but not for predicting new unseen individual cases yi - Often not the optimal model for prediction models, very often the optimal model for explanatory research is not optimal for prediction
Describe a model that under-fits
Performs poorly on the training data.
It did not capture the relationship between the predictors X and outcome Y. Performance on test data will be even worse. The model is biased!
No curvilinear relationship thus we assume that it explains least amount of variance in our training data.
Describe a model that over-fits
A model overfittedour training data when the model performs well on the training data but poorly on the new test data. This is because the model fitted noise in the training data.
The model memorizes the exact pattern of data it has seen and is unable to generalize to unseen examples, because the pattern will not reappear.
The model variance is high.
We want a model to work well on the future data, not on the training data!
Over-fitted model explains variance best β very small difference between observed and predicted values
Describe a balanced model
A model is balanced (βjust rightβ) when it captures the true pattern and therefore predicts well new unseen cases.
Balanced dataset which describes curvilinear relationship should explain variance better than under-fitted model but not over-fitted model
Under-fitting is a bigger problem than over-fitting
True or false
True
If we assume that there is a relationship between outcome Y (depression score) and at least one of the p independent variables X (clinical and demographic characteristics)
How can we model this?
π=π(π)+π
where π( ) is an unknown function
and π random error with mean 0 and variance 0 and o2
Then the expected mean squared prediction error is:
E(MSE)= (πππππ π΅πππ )2 +πππππ ππππππππ + Ο2 (noise that we cannot explain)
Want model with smallest prediction error
What does the expected mean squared prediction error mean?
E(MSE)= (πππππ π΅πππ )2 +πππππ ππππππππ + Ο2
Bias is result of misspecifying the model f
Reflects how close the functional form of the model is to the true model
High bias results in underfitting
Model estimation variance is result of using sample to estimate f(x)
Quantifies the dependency of the data points used to build the model.
High variance = small changes in data change model parameter estimates substantial
High variance results in overfitting
oπ is irreducible error even if the model f is correctly specified and estimated
How do explanatory and prediction modelling optimize MSE differently?
In explanatory modelling we try to minimize MSE in our training dataset and usually MSE predicted new cases is small than MSE in our training sample
How can we improve prediction accuracy?
By reducing the variability of the regression estimates (Model variance) at the cost of increased bias (Model bias) β because OLS is unbiased thereby model bias is 0 and the MSE is caused by model variance plus error
Thus, can confine model that is biased but has smaller variance and this would improve prediction accuracy
The best explanatory model is often different from the best prediction model! β as we optimise a different way
If we fitted 90 parameters to a data set of 100 persons, model explains almost 100% of the training data, absolute values of coefficients are too large and we have type I error (too many significant parameters). What does this mean?
Our model overfits the training data and is unlikely to predict well a new random sample
Our OLS model is unbiased (with increasing sample size parameters would move towards the true value)
The absolute values of the parameters are too large (many well above 0).
How could we improve the model?
We need to shrink the regression coefficients somehow
- Shrinkage (or regularization) helps preventing linear models from overfitting training data by shrinking the coefficients towards 0.
What do shrinkage of regularization methods perform?
Linear regression while regularizing or shrinking the estimated coefficients towards 0.
Why does shrinkage help with overfitting?
It introduces bias but may decrease the variance of the estimates. If the latter effect is larger, we would decrease the test error!
By sacrificing unbiasedness, we can reduceβ¦
the variance to make the overall MSE lower
With regularised methods what do we introduce?
Some bias and on average we are bit away from true parameter but there is very little model variance. All data close to true value and model will predict well. Thus on average closer to true values than ordinary least square method.
What did Houwellingen and le Cessie (1990) develop?
Heuristic shrinkage estimate:
πΎΜ=(πππππ π2βπ)/(πππππ π2 )
where
p is the total degree of freedoms of the predictors (number of parameter -1) and
π2 the likelihood ratio statistics for testing the overall effect of all predictors.
For linear model with an intercept of b0 and coefficients π½Μπ(j=1,2β¦p) , the shrunken estimates are easily estimated:
π βππ’ππππ π½^π=y^(π½Μπ) β gamma hat x estimated regression coefficient
π βππ’ππππ π½0=(1βπΎΜ ) πΜ +πΎΜ (π½0)) - Intercept is 1 β shrinkage estimate x mean of all Yβs x mean of all observed outcomes plus shrinkage estimate x intercept
The model with shrunken regression coefficients predict on average better new unseen cases
Evaluate Heuristic shrinkage estimate
Works reasonably well in generalized linear models (i.e. Steyerberg et al. 2001)
p is the number of candidate predicators if variable selection is used! β If remove some variables because they are not correlated with outcome, must use variables started with not the final number of variables in model
It would be better to integrate the shrinkage in the model building process.
Find a shrinkage on the parameters that optimizes the prediction of unseen cases
- A general shrinkage procedure are penalized (or regularized) regression methods
What is a modern approach to prediction modelling?
Regularized or penalized methods that can be applied to both large data sets (bioinformatics, neuroimaging, wearable data) and small data sets with a large number of variables (RCTs , experimental studies, cohort studies).
Not really new:
Ridge regression: Arthur Hoerl and Robert Kennard (1970)
limited computer power restricted their use
What is the basic principle of penalised methods?
To improve prediction accuracy by reducing the variability of the regression estimates at the cost of increased bias (shrinkage)
What are advantages to using penalised methods?
- Also allows automatic variable selection by shrinkage:
As the coefficients of the weaker predictors are shrunk towards zero β removed from regression model
This is very useful for high dimensional data (p»_space; n)
- Can also effectively deal with with ill-conditioned regression problems
Multi-collinearity (and redundancy - too many variables in dataset and number of variants is close to sample size )
number of variables (p) is close to the sample size (n)
- What happen when a model overfits data?
- How can this be remedied?
- Standard estimates of regression coefficients become inflated or unstable
- Estimates can be stabilised (regularised) by adding a penalty to the estimating equations
For linear regressions, the penalty is added to the residual sum of squared errors (RSS)
π ππ(π)= β(π¦πβπ¦Μπ )2 +ππ(π½)
In OLS we try to minimize RSS to find best optimal unbiased estimate of our regression coefficient
In penalized regression we add a penalty term called lambda x function of regression coefficient
Larger lambda the larger the penalty in our residual sums of squared error
In principle with the right choice of lambda what can we get?
An estimator with a better MSE
Estimate is not unbiased but what we pay for in bias we make up for in variance
By sacrificing unbiasedness, we can reduce the variance to make the overall MSE lower
We try to find a lambda(penalty) that does what?
Minimises error of unseen cases
If lambda is 0 then we have an OLS method and we have a bias of 0 if increase lambda then bias becomes much larger and MSE becomes larger but on other hand if our ordinary squared method or variance might be large but if increase lambda is becomes smaller
What are 3 commonly used functions?
- Ridge penalty
π(π½)= βπ½^2
sum of squared coefficients (βπ½2) forms the penalty
Also called L2 norm as squared
- LASSO (Least Absolute Shrinkage and Selection Operator):
π(π½)= β|π½|
sum of absolute coefficients (ο₯|ο’|) forms the penalty
Also called L1 norm as its beta to the power of one
- Elastic net
β a combination of L1 and L2 norm r egularization
What is commonly used to deal with ill conditioned regression problems such as multi-collinearity (high correlation between predictor variables) and
number of variables (p) is close to the sample size (n)?
Ridge regression
How is ridge regression of Ξ² obtained?
By minimising root sum of squares plus our penalty which is lambda times sum of squared regression coefficients, this is penalty term which we add to our root RSS and we now call this RSS
The parameter Ξ» scales the norm - controls the amount of penalty
What is one of the important problems in applying ridge regression?
To choose the right value of Ξ»
What does LASSO (Least Absolute Shrinkage and Selection Operator) a promising technique for?
Variable selection
Finding a small subset of most predictive variables in a high dimensional dataset is an interesting and important problem
How does LASSO tend to deal with overfitting?
Tends to assign zero coefficients to most irrelevant or redundant variables - This is also called a sparse solution
How are LASSO estimates obtained?
Minimised RSS plus lasso penalty term which is lambda times absolute value of regression parameters
This is called L1 penalty/norm
Lasso penalty involves absolute values of regression parameters and not sum of the squared values like in ridge regression
Need to find best lambda to minimise penalised root mean squared error
Similar to ridge regression, the penalty parameter (Ξ») controls the amount of penalty (user customisable)
If compute lasso or ridge data must beβ¦
Standardised so those with large range do not dominate model selection:
Different units (m versus km) would result in different solutions
This is automatically done in most software packages
R packages such as βGlmnetβ back transforms final regression coefficient on original scale!
What is the z-transformation formula?
Linear transformation of values to common mean of zero and stand deviation of 1:
zi = (π₯πβπ₯Μ
)/π
with
π§π=π§ π‘ππππ ππππeπ πππ πππ£ππ‘πππ
ππ πππ π π ππ π‘βπ π πππππ
π₯Μ
=π πππππ ππππ
π₯π=πππππππ π£πππ’π ππ πππ π π
π =π π‘ππππππ πππ£πππ‘πππ ππ π πππππ
The z-transformation changes the form of the distribution, it only adjusts the mean and the standard deviation!
True or false
FALSE
z-transformation does not change the form of the distribution, it only adjusts the mean and the standard deviation!
How do we select lambda?
Goal is to evaluate the model in terms of itβs ability to predict future observations:
The model need to be evaluated on a dataset that was not used to build the model (test sets)
We assess different lambdas and choose the one which predicts best unseen cases using cross-validation
This best lambda is used fit the model using the complete data set
Take average MSE, and calculate mse of 100 different lambdas of different strength and pick lamda with smallest MSE
We pick the lambda which best predicts unseen cases (= smallest Mean squared error, MSE)
What is the performance of our model function measured by?
A loss function for penalizing error in prediction.
What is a loss of function measured by?
How good a prediction model does in terms of being able to predict the expected outcome..
What is a popular loss of function?
MSE loss function
We decide to choose the function f(x) which minimizes the expected loss or here the expected mean squared prediction error (MSE).
The expected MSE can be estimated by cross-validation or bootstrapping methods - We use the same methodology as for internal validation!
If build optimal predictive models any sensible subset selection algorithm can be combined with what?
Cross-validation to build a good prediction model
The idea is to build a large number of alternative models (of varying complexities) and evaluate the predictive performance using cross-validation to select the best model
In regularized regression we compare models with different lambdas!
Using hold-out data for prediction accuracy estimation involves what?
Using CV to select optimal π selects the best set of predictors of unseen cases.
However: Prediction accuracy measures are over-optimistic estimates for accuracy of future sample: CV test data were used to select our model!
What is ridge not useful for?
Parsimonious model selection
Ridge penalty function is very flat near the zero values of Ξ², what does this mean?
Does not encourage the Ξ² coefficients to be exactly zero
Not good for variable selection
Not good for sparse problems
Alternative penalised methods (e.g., LASSO, see next) is a better option for variable selection
What is lambda.1se?
This is a slightly stronger penalty than the minimum lambda and lies within one standard error of the optimal value of lambda.
The purpose of regularization is often to balance accuracy and simplicity: We want a model with the smallest number of predictors that also gives a good accuracy.
Setting lambda = lambda.1se results in a simpler model compared to lambda.min (less variables are selected), but the model might be a little bit less accurate than the one obtained with min.lambda.
Research suggest that this lambda sometimes predicts better in external data sets and selects less false positive predictors.
When should you compute OLS?
If you have large sample sizes with a relative small number of variables of likely predictors (theory-driven)
When should you compute Ridge?
If you expect many small effect sizes and predictors are likely true ones (you want to keep all variables in the model).
When should you compute Lasso?
If you have a few stronger predictors among a large number of likely weak predictors or noise variables.
What is not very meaningful in penalised regressions and why?
Statistical Inference of regression coefficients
This is because the penalised estimates are biased towards zero
Standard Error (SE) of penalised coefficients give only partial information of the precision
SE ignores the inaccuracy caused by bias
Software packages do not supply standard errors (SE), confidence interval (CI), or p-values for penalised regression. Internal validation is our βtestβ
Major aim in penalised regression is to build a prediction model/variable selection rather than performing statistical inference
Regularized or penalized regressions are extensions of the linear model.
True or false
True
Regularized or penalized regressions seek to do what?
Minimise the sum of the squared error (or MSE) of the model on the training data but also to try avoiding over-fitting by reducing the complexity of the model at the cost of some bias
This is done by shrinking the regression coefficients
What are two popular examples of Regularized or penalized regressions?
Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).
Lasso Regression: where OLS is modified to also minimize the absolute sum of the coefficients (L1 regularization).
Unlike Ridge, Lasso regression performs variable selection by shrinking some coefficients to 0