Week 10 - Multiple regression Flashcards
Multiple regression
Linear regression model with multiple predictors
Interpretation of regression coefficients
The contribution to y of a given predictor, assuming the other predictors stay fixed
Influential data points
Some data points could have a large impact on the model fit
Typically undesirable (we want the model to “use” all of the data)
Two ways to identify such points: Leverage and Cook’s distance
Leverage
how far a observation is from the main cluster of values ONLY BASED ON THE X AXIS
The further away, the more likely it can “tilt” the regression model
Changing the y-value of a high-leverage point can noticeably shift the entire regression line
Cook’s distance
Measure the influence of a given data point
Uses the response variable (Y) and the predictors (X)
Combines information about both:
- leverage (how far an observation’s x-value is from the mean of all x-values)
- residuals (how far the actual y-value is from the predicted y-value)
If we were to remove a point, how would the line of best fit change?
Finding highly influential points
Points with leverage score or Cook’s D that exceed:
2p/n
- p is the number of predictor variables (includes intercept/constant)
- n is the number of observations
You need to get the score for each point and if higher than threshold, it can be considered a highly influential point
- “.hat” is the leverage score
- “.cooksd” is Cook’s D
Collinearity
Where some predictors are highly correlated with each other
Impacts of Collinearity
- Can’t disentangle their influence
- Uncertain estimates of regression coefficients
- Model has to “choose” whether to favour one or the other
- Numerical instability
Possible solutions to Collinearity
- Remove some predictors (consider them redundant)
- Ignore problem (mainly interested in prediction)
Variance inflation factor (VIF)
Used to assess the degree of collinearity among predictor variables
VIF values greater than 10 are considered high
Model performance measures
Determine which predictors will give us a good model
Log-likelihood (maximised) - Bad -> more predictors will increase likelihood
R^2 - Bad -> more predictors will increase R^2
Adjusted R^2 - Good -> adds a “penalty” for adding extra
predictors
AIC (smaller value the better) - Good -> A penalised likelihood measure
BIC (smaller value the better) - Good -> A penalised likelihood measure
AIC vs BIC
The AIC is usually better for choosing a good predictive model (we prefer AIC)
The BIC is usually better for choosing a “true” model