Tutorial 3 - Model Misspecification, Model Choice, Model Diagnostics, Multicollinearity Flashcards
What are possible ways to model nonlinear effects?
- Log-transformation of y and/or x
- Higher-order polynomials in x (quadratic, cubic,…)
- Semi- or nonparametric regression (not covered here)
- Nonlinear regression models (not covered here)
What is the calculation to get the returns to experience in this regression?
How to calculate the gender wage gap in this case?
What is multicollinearity?
Perfect multicollinearity: one regressor can be expressed as a perfect linear combination of one or several other regressors.
What does multicollinearity mean mathematically?
This means that the N × K regressor matrix X does not have full column rank K
-> X’X is singular (not invertible), thus OLS estimator β^ = (X’X)⁻¹X’y is not identifiable.
What is the problem with this regression and how to solve it?
Perfect multicollinearity. One variable will drop out!
-> Easy to detect and solve: leave one category out as the reference category
What are the consequences of “imperfect multicollinearity”?
- OLS coefficients still unbiased, but more unstable (i.e. coefficients might be very different if another sample was used).
- Standard errors also unbiased, but they can be large. Thus, regressors may be individually insignificant, even if they are jointly significant.
- Multicollinearity is not a problem if the aim is to predict y (rather than to estimate the effect of a single regressor on y).
What is the Variance inflation factor?
a measure for multicollinearity
How do you calculate the Variance inflation factor?
For variable xⱼ , the VIF is defined as below, where Rⱼ² denotes the R² from a regression of variable xⱼ on all other
covariates xₖ (k ≠ j ).
What is the idea and interpretation of the variance inflation factor (VIF)?
- Idea: strong linear dependence between xⱼ and all other covariates results in a high Rⱼ²-> high VIFⱼ .
- This is called the variance inflation factor because the higher the dependence between xⱼ and the other covariates, the higher is Var( β^ⱼ ).
- Rule of thumb:
- VIFⱼ > 10 implies serious multicollinearity,
- VIFⱼ = 1 would mean that xⱼ has zero correlation with the other regressors
What other way (except VIF) is there to check for multicollinearity?
test whether regressors are jointly significant -> F-Test for joint significance
How to detect outliers/influential observations?
- Residual analysis: Plot std/stud residuals againt fitted values
- Cook’s distance as a measure of influence for observation i
How do you calculate Cook’s distance?
How do you interpret the Cook’s distance?
Rule of thumb: Cᵢ > 4/N -> large influence!
What is the total sum of squares (“Total variance”)?