Linear Model Evaluation/Diagnostics Flashcards
What is the R^2 statistic?
The proportion of variance explained by the model
Why is R^2 not always the best indicator of predictive power?
An overfitted model will have great R^2 statistic but poor predictive power
High variance, then low R^2 score even if correct model
What are the assumptions we make about the errors in a model?
- well described by a normal distribution
- have constant variance
- are independent of each other
What do we expect to see when remove the signal from the model?
Residuals that are normally distributed
What are the two qualitative ways to assess normality?
- look at histogram of the residuals
- a QQ Norm plot of the residuals
What are the two quantitative ways to assess normality?
- Wilk Shapiro test for Normality
- Kolmogorov Smirnov test for Normality
Describe QQ Norm plots
- Plot the quantiles of two sets of data against each other
- If there shapes are similar and roughly normally distributed, tend to get a straight line
- plots the residuals sorted in order, against the standardised quantiles for the distribution of interest
What is the ith point of QQ Norm plots typically given by?
i/(n+1)
Describe the Shapiro Wilks Test
Produces a statistic which relates to the straightness of the QQ plot
Null hypothesis, H0: data are normally distributed
What will happen if the assumption that the errors are independent is violated?
Standard errors and p values are systematically too small and risk drawing the wrong conclusions about model covariates
What can the null hypothesis of uncorrelated errors be formally tested by?
Durbin Watson test
What can independence also be violated by?
Philosophical ways, like pseudoreplication
What is the practical consequence of falsely assuming independence?
Can conclude that one or more unrelated variables are genuinely related to the response
What can we do if we have correlation in the residuals?
- Ignore the correlation in residuals
- Try to remove the correlation in model residuals by sub-setting the data
- Account for the correlation using, a generalized least squares model
What do we use partial residual plots for?
To address if non-linearity of predictors is caused by either predictor or another unknown one