Tutorial 3 - Model Misspecification, Model Choice, Model Diagnostics, Multicollinearity Flashcards

1
Q

What are possible ways to model nonlinear effects?

A
  • Log-transformation of y and/or x
  • Higher-order polynomials in x (quadratic, cubic,…)
  • Semi- or nonparametric regression (not covered here)
  • Nonlinear regression models (not covered here)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the calculation to get the returns to experience in this regression?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to calculate the gender wage gap in this case?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is multicollinearity?

A

Perfect multicollinearity: one regressor can be expressed as a perfect linear combination of one or several other regressors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does multicollinearity mean mathematically?

A

This means that the N × K regressor matrix X does not have full column rank K
-> X’œX is singular (not invertible), thus OLS estimator β^ = (Xœ’X)⁻¹X’œy is not identifiable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the problem with this regression and how to solve it?

A

Perfect multicollinearity. One variable will drop out!
-> Easy to detect and solve: leave one category out as the reference category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the consequences of “imperfect multicollinearity”?

A
  • OLS coefficients still unbiased, but more unstable (i.e. coefficients might be very different if another sample was used).
  • Standard errors also unbiased, but they can be large. Thus, regressors may be individually insignificant, even if they are jointly significant.
  • Multicollinearity is not a problem if the aim is to predict y (rather than to estimate the effect of a single regressor on y).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Variance inflation factor?

A

a measure for multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you calculate the Variance inflation factor?

A

For variable xⱼ , the VIF is defined as below, where Rⱼ² denotes the R² from a regression of variable xⱼ on all other
covariates xₖ (k ≠ j ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the idea and interpretation of the variance inflation factor (VIF)?

A
  • Idea: strong linear dependence between xⱼ and all other covariates results in a high Rⱼ²-> high VIFⱼ .
  • This is called the variance inflation factor because the higher the dependence between xⱼ and the other covariates, the higher is Var( β^ⱼ ).
  • Rule of thumb:
    • VIFⱼ > 10 implies serious multicollinearity,
    • VIFⱼ = 1 would mean that xⱼ has zero correlation with the other regressors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What other way (except VIF) is there to check for multicollinearity?

A

test whether regressors are jointly significant -> F-Test for joint significance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to detect outliers/influential observations?

A
  • Residual analysis: Plot std/stud residuals againt fitted values
  • Cook’s distance as a measure of influence for observation i
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you calculate Cook’s distance?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you interpret the Cook’s distance?

A

Rule of thumb: Cᵢ > 4/N -> large influence!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the total sum of squares (“Total variance”)?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the residual sum of squares (“Unexplained variance”)?

A
17
Q

What is the explained sum of squares (“Explained variance”)?

A
18
Q

How do you interpret a high R²?

A

A high R² means that much of the variance of y is explained by the regressors, i.e. the model fits the data well.

19
Q

How do you calculate R²?

A
20
Q

What is a problem with R²?

A

R² always increases if further regressors are added to the model, if even these regressors have no additional explanatory power.

21
Q

How do you calculate the adjusted R²?

A
22
Q

How do you calculate the mean squared error (MSE)?

A
23
Q

How can you prove the equation below?

A
24
Q

What is the definition of the Mean Squared Error of Prediction (MSEP) for model m?

A

expected squared deviation of true values from predicted values:

25
Q

Which three parts does the MSEP consist of?

A
26
Q

What is the Bias-Variance-Tradeoff, especially w.r.t. the MSEP?

A
  • More flexible models (with more covariates, polynomials of higher order, interaction terms…) reflect the variation in the data better -> lower bias.
  • But more flexible models also have a higher variance; and they also are more sensitive to variation in the data that is only random (especially if the data set is small). -> higher variance, danger of “overfitting”.
27
Q

How can the MSEP be estimated?

A
  • |m| is the number of parameters in the model,
  • σ² is a consistent (and, if possible, model-independent) estimator of the error variance Var(ϵ) = σ².
28
Q

For R², MSEP, AIC, BIC and GCV, which model do you choose?

A
  • R²: highest value
  • MSEP, AIC, BIC, GCV: lowest (absolute) value
29
Q

How do you calculate the AIC?

A
30
Q

How do you calculate the BIC?

A
31
Q

How do you calculate the GCV?

A
32
Q

How can you use a Validation Sample for model selection?

A

“split sampling”, i.e. randomly split the sample into two halfs. Estimate the model for the rst subsample. Then check the model’s performance using the other subsample.

33
Q

What are potential Variable Selection Procedures?

A
  • Backward Elimination
  • Forward Selection
  • Stepwise Procedures
34
Q

How do you interpret the coefficient for “female”?

A

it is the difference between male and female with 0 experience, keeping all other variables constant