Tutorial 3 - Model Misspecification, Model Choice, Model Diagnostics, Multicollinearity Flashcards
What are possible ways to model nonlinear effects?
- Log-transformation of y and/or x
- Higher-order polynomials in x (quadratic, cubic,…)
- Semi- or nonparametric regression (not covered here)
- Nonlinear regression models (not covered here)
What is the calculation to get the returns to experience in this regression?
How to calculate the gender wage gap in this case?
What is multicollinearity?
Perfect multicollinearity: one regressor can be expressed as a perfect linear combination of one or several other regressors.
What does multicollinearity mean mathematically?
This means that the N × K regressor matrix X does not have full column rank K
-> X’X is singular (not invertible), thus OLS estimator β^ = (X’X)⁻¹X’y is not identifiable.
What is the problem with this regression and how to solve it?
Perfect multicollinearity. One variable will drop out!
-> Easy to detect and solve: leave one category out as the reference category
What are the consequences of “imperfect multicollinearity”?
- OLS coefficients still unbiased, but more unstable (i.e. coefficients might be very different if another sample was used).
- Standard errors also unbiased, but they can be large. Thus, regressors may be individually insignificant, even if they are jointly significant.
- Multicollinearity is not a problem if the aim is to predict y (rather than to estimate the effect of a single regressor on y).
What is the Variance inflation factor?
a measure for multicollinearity
How do you calculate the Variance inflation factor?
For variable xⱼ , the VIF is defined as below, where Rⱼ² denotes the R² from a regression of variable xⱼ on all other
covariates xₖ (k ≠ j ).
What is the idea and interpretation of the variance inflation factor (VIF)?
- Idea: strong linear dependence between xⱼ and all other covariates results in a high Rⱼ²-> high VIFⱼ .
- This is called the variance inflation factor because the higher the dependence between xⱼ and the other covariates, the higher is Var( β^ⱼ ).
- Rule of thumb:
- VIFⱼ > 10 implies serious multicollinearity,
- VIFⱼ = 1 would mean that xⱼ has zero correlation with the other regressors
What other way (except VIF) is there to check for multicollinearity?
test whether regressors are jointly significant -> F-Test for joint significance
How to detect outliers/influential observations?
- Residual analysis: Plot std/stud residuals againt fitted values
- Cook’s distance as a measure of influence for observation i
How do you calculate Cook’s distance?
How do you interpret the Cook’s distance?
Rule of thumb: Cᵢ > 4/N -> large influence!
What is the total sum of squares (“Total variance”)?
What is the residual sum of squares (“Unexplained variance”)?
What is the explained sum of squares (“Explained variance”)?
How do you interpret a high R²?
A high R² means that much of the variance of y is explained by the regressors, i.e. the model fits the data well.
How do you calculate R²?
What is a problem with R²?
R² always increases if further regressors are added to the model, if even these regressors have no additional explanatory power.
How do you calculate the adjusted R²?
How do you calculate the mean squared error (MSE)?
How can you prove the equation below?
What is the definition of the Mean Squared Error of Prediction (MSEP) for model m?
expected squared deviation of true values from predicted values:
Which three parts does the MSEP consist of?
What is the Bias-Variance-Tradeoff, especially w.r.t. the MSEP?
- More flexible models (with more covariates, polynomials of higher order, interaction terms…) reflect the variation in the data better -> lower bias.
- But more flexible models also have a higher variance; and they also are more sensitive to variation in the data that is only random (especially if the data set is small). -> higher variance, danger of “overfitting”.
How can the MSEP be estimated?
- |m| is the number of parameters in the model,
- σ² is a consistent (and, if possible, model-independent) estimator of the error variance Var(ϵ) = σ².
For R², MSEP, AIC, BIC and GCV, which model do you choose?
- R²: highest value
- MSEP, AIC, BIC, GCV: lowest (absolute) value
How do you calculate the AIC?
How do you calculate the BIC?
How do you calculate the GCV?
How can you use a Validation Sample for model selection?
“split sampling”, i.e. randomly split the sample into two halfs. Estimate the model for the rst subsample. Then check the model’s performance using the other subsample.
What are potential Variable Selection Procedures?
- Backward Elimination
- Forward Selection
- Stepwise Procedures
How do you interpret the coefficient for “female”?
it is the difference between male and female with 0 experience, keeping all other variables constant