QA9 - Regression Diagnostics Flashcards
Explain how to test whether a regression is affected by heteroskedasticity
Heteroskedasticity is where the variance of error systematically varies with one or more explanatory variables
- Estimate the model and compute the residuals ei
- Regress ei against constant, all variables, and cross product of all variables
(ei = v0 + v1 * Xi1 + v2 * Xi2 + v3 * Xi1^2 + v4 * Xi1 * Xi2 + v5 * Xi2^2 + h) - If homoscedastic , W0: v1 = v2 = .. = vn = 0
Describe approaches to using heteroskedastic data
- ignore them and use heteroskedastic robust residuals for hypothesis testing
- transform data to remove (log for example)
- use weighted least squares for parameter estimation
Characterise multicollinearity and its consequences; distinguish between multicollinearity and perfect collinearity
Multicollinearity is where one or more explanatory variables can be substantially explained by the others. When modelling this means jointly significant variables may have very small individual f-stats.
Perfect collinearity is where one of the variables is perfectly described by another
Describe the consequences of excluding a relevant explanatory variable from a model and contrast those with the consequences of including a relevant regressor
Omitting important variable means increasing bias in the model
Including irrelevant variable reduces adjusted R^2 due to penalty factor
Explain two model selection procedures and how these relate to the bias variance tradeoff
Bias-variance is balancing large models with low bias but less precise parameters, with high bias but less estimation error models
- General-to-specific:
- start with model with lots of parameters
- remove variable with smallest T stat that is insignificant
- repeat until all variables significant - M-fold cross-validation
- split data into m blocks, fit all candidate models on m - 1 blocks
- calculate the residuals on the unused block
- take the model with smallest residual
Describe methods for identifying outliers and their impact
Found using cooks distance, find residuals of model with potential outlier dropped, if Dj > 1 then outlier