Session 7: Regularised Regression Flashcards
What does the standard statistical approach assume?
A parametric model such as the linear model
π(πΏπ )= π½0+π½1 π1+β¦+π½π ππ
If we hypothesize a linear model for the response y
π¦= π½0+π½1 π1+β¦+π½πππ + πΊ
where πΊ =N(0, Ο2)
To estimate unknown parameter π½_0,π½_1,β¦π½_π we what?
Ordinary least square (OLS) methods:
We choose π½_0 , π½_1β¦, π½_π to minimize the residual sum of squares between observed and predicted responses of the (same) data set.
Statistical inference assuming normal distributing of the error π allows constructing confidence intervals around parameters and performing statistical tests.
What are problems with the OLS estimator?
There are situations when the OLS estimator does not work well:
When the independent variables are highly correlated- We get unstable, highly variable estimates
The number of parameters relative to the sample size is large - danger of overfitting
We optimise for an unbiased estimator of mean response E(y) but not for predicting new unseen individual cases yi - Often not the optimal model for prediction models, very often the optimal model for explanatory research is not optimal for prediction
Describe a model that under-fits
Performs poorly on the training data.
It did not capture the relationship between the predictors X and outcome Y. Performance on test data will be even worse. The model is biased!
No curvilinear relationship thus we assume that it explains least amount of variance in our training data.
Describe a model that over-fits
A model overfittedour training data when the model performs well on the training data but poorly on the new test data. This is because the model fitted noise in the training data.
The model memorizes the exact pattern of data it has seen and is unable to generalize to unseen examples, because the pattern will not reappear.
The model variance is high.
We want a model to work well on the future data, not on the training data!
Over-fitted model explains variance best β very small difference between observed and predicted values
Describe a balanced model
A model is balanced (βjust rightβ) when it captures the true pattern and therefore predicts well new unseen cases.
Balanced dataset which describes curvilinear relationship should explain variance better than under-fitted model but not over-fitted model
Under-fitting is a bigger problem than over-fitting
True or false
True
If we assume that there is a relationship between outcome Y (depression score) and at least one of the p independent variables X (clinical and demographic characteristics)
How can we model this?
π=π(π)+π
where π( ) is an unknown function
and π random error with mean 0 and variance 0 and o2
Then the expected mean squared prediction error is:
E(MSE)= (πππππ π΅πππ )2 +πππππ ππππππππ + Ο2 (noise that we cannot explain)
Want model with smallest prediction error
What does the expected mean squared prediction error mean?
E(MSE)= (πππππ π΅πππ )2 +πππππ ππππππππ + Ο2
Bias is result of misspecifying the model f
Reflects how close the functional form of the model is to the true model
High bias results in underfitting
Model estimation variance is result of using sample to estimate f(x)
Quantifies the dependency of the data points used to build the model.
High variance = small changes in data change model parameter estimates substantial
High variance results in overfitting
oπ is irreducible error even if the model f is correctly specified and estimated
How do explanatory and prediction modelling optimize MSE differently?
In explanatory modelling we try to minimize MSE in our training dataset and usually MSE predicted new cases is small than MSE in our training sample
How can we improve prediction accuracy?
By reducing the variability of the regression estimates (Model variance) at the cost of increased bias (Model bias) β because OLS is unbiased thereby model bias is 0 and the MSE is caused by model variance plus error
Thus, can confine model that is biased but has smaller variance and this would improve prediction accuracy
The best explanatory model is often different from the best prediction model! β as we optimise a different way
If we fitted 90 parameters to a data set of 100 persons, model explains almost 100% of the training data, absolute values of coefficients are too large and we have type I error (too many significant parameters). What does this mean?
Our model overfits the training data and is unlikely to predict well a new random sample
Our OLS model is unbiased (with increasing sample size parameters would move towards the true value)
The absolute values of the parameters are too large (many well above 0).
How could we improve the model?
We need to shrink the regression coefficients somehow
- Shrinkage (or regularization) helps preventing linear models from overfitting training data by shrinking the coefficients towards 0.
What do shrinkage of regularization methods perform?
Linear regression while regularizing or shrinking the estimated coefficients towards 0.
Why does shrinkage help with overfitting?
It introduces bias but may decrease the variance of the estimates. If the latter effect is larger, we would decrease the test error!
By sacrificing unbiasedness, we can reduceβ¦
the variance to make the overall MSE lower
With regularised methods what do we introduce?
Some bias and on average we are bit away from true parameter but there is very little model variance. All data close to true value and model will predict well. Thus on average closer to true values than ordinary least square method.
What did Houwellingen and le Cessie (1990) develop?
Heuristic shrinkage estimate:
πΎΜ=(πππππ π2βπ)/(πππππ π2 )
where
p is the total degree of freedoms of the predictors (number of parameter -1) and
π2 the likelihood ratio statistics for testing the overall effect of all predictors.
For linear model with an intercept of b0 and coefficients π½Μπ(j=1,2β¦p) , the shrunken estimates are easily estimated:
π βππ’ππππ π½^π=y^(π½Μπ) β gamma hat x estimated regression coefficient
π βππ’ππππ π½0=(1βπΎΜ ) πΜ +πΎΜ (π½0)) - Intercept is 1 β shrinkage estimate x mean of all Yβs x mean of all observed outcomes plus shrinkage estimate x intercept
The model with shrunken regression coefficients predict on average better new unseen cases
Evaluate Heuristic shrinkage estimate
Works reasonably well in generalized linear models (i.e. Steyerberg et al. 2001)
p is the number of candidate predicators if variable selection is used! β If remove some variables because they are not correlated with outcome, must use variables started with not the final number of variables in model
It would be better to integrate the shrinkage in the model building process.
Find a shrinkage on the parameters that optimizes the prediction of unseen cases
- A general shrinkage procedure are penalized (or regularized) regression methods
What is a modern approach to prediction modelling?
Regularized or penalized methods that can be applied to both large data sets (bioinformatics, neuroimaging, wearable data) and small data sets with a large number of variables (RCTs , experimental studies, cohort studies).
Not really new:
Ridge regression: Arthur Hoerl and Robert Kennard (1970)
limited computer power restricted their use