Linear Model Evaluation/Diagnostics Flashcards
What is the R^2 statistic?
The proportion of variance explained by the model
Why is R^2 not always the best indicator of predictive power?
An overfitted model will have great R^2 statistic but poor predictive power
High variance, then low R^2 score even if correct model
What are the assumptions we make about the errors in a model?
- well described by a normal distribution
- have constant variance
- are independent of each other
What do we expect to see when remove the signal from the model?
Residuals that are normally distributed
What are the two qualitative ways to assess normality?
- look at histogram of the residuals
- a QQ Norm plot of the residuals
What are the two quantitative ways to assess normality?
- Wilk Shapiro test for Normality
- Kolmogorov Smirnov test for Normality
Describe QQ Norm plots
- Plot the quantiles of two sets of data against each other
- If there shapes are similar and roughly normally distributed, tend to get a straight line
- plots the residuals sorted in order, against the standardised quantiles for the distribution of interest
What is the ith point of QQ Norm plots typically given by?
i/(n+1)
Describe the Shapiro Wilks Test
Produces a statistic which relates to the straightness of the QQ plot
Null hypothesis, H0: data are normally distributed
What will happen if the assumption that the errors are independent is violated?
Standard errors and p values are systematically too small and risk drawing the wrong conclusions about model covariates
What can the null hypothesis of uncorrelated errors be formally tested by?
Durbin Watson test
What can independence also be violated by?
Philosophical ways, like pseudoreplication
What is the practical consequence of falsely assuming independence?
Can conclude that one or more unrelated variables are genuinely related to the response
What can we do if we have correlation in the residuals?
- Ignore the correlation in residuals
- Try to remove the correlation in model residuals by sub-setting the data
- Account for the correlation using, a generalized least squares model
What do we use partial residual plots for?
To address if non-linearity of predictors is caused by either predictor or another unknown one
What do partial residual plots show?
Residuals and relationships between y and individual x with adjustments for other x’s
What are the useful diagnostic properties of partial residuals?
- The slope of the line is the regression coefficent
- The extent of the scatter tells us about the support for the function
- We can identify large residuals
- Curved plots signal non-linear relationships
What do we do if we have error distribution shape problems?
- Try transforming things to address the distributional shape problems
- Move to other models like generalised linear models
- Bootstrap your way to glory
What do we do if we have independence problems?
- Move to other models/methods like mixed models (LMM, GLMM) or used generalised estimating equations
What do we do if we have signal problems?
- Use complex linear models or generalised additive models (GAM)
How do we bootstrap?
- We make a new dataset of the same dimension, by sampling the rows of the data with replacement
- We do this a lot and fit models at each stage
- This shows how roughly things might change if we were to have another sample of data