Tutorial 3 - Model Misspecification, Model Choice, Model Diagnostics, Multicollinearity Flashcards by Birwe Leon

What are possible ways to model nonlinear effects?

Log-transformation of y and/or x
Higher-order polynomials in x (quadratic, cubic,…)
Semi- or nonparametric regression (not covered here)
Nonlinear regression models (not covered here)

How well did you know this?

Not at all

Perfectly

What is the calculation to get the returns to experience in this regression?

How well did you know this?

Not at all

Perfectly

How to calculate the gender wage gap in this case?

How well did you know this?

Not at all

Perfectly

What is multicollinearity?

Perfect multicollinearity: one regressor can be expressed as a perfect linear combination of one or several other regressors.

How well did you know this?

Not at all

Perfectly

What does multicollinearity mean mathematically?

This means that the N × K regressor matrix X does not have full column rank K
-> X’X is singular (not invertible), thus OLS estimator β^ = (X’X)⁻¹X’y is not identifiable.

How well did you know this?

Not at all

Perfectly

What is the problem with this regression and how to solve it?

Perfect multicollinearity. One variable will drop out!
-> Easy to detect and solve: leave one category out as the reference category

How well did you know this?

Not at all

Perfectly

What are the consequences of “imperfect multicollinearity”?

OLS coefficients still unbiased, but more unstable (i.e. coefficients might be very different if another sample was used).
Standard errors also unbiased, but they can be large. Thus, regressors may be individually insignificant, even if they are jointly significant.
Multicollinearity is not a problem if the aim is to predict y (rather than to estimate the effect of a single regressor on y).

How well did you know this?

Not at all

Perfectly

What is the Variance inflation factor?

a measure for multicollinearity

How well did you know this?

Not at all

Perfectly

How do you calculate the Variance inflation factor?

For variable xⱼ , the VIF is defined as below, where Rⱼ² denotes the R² from a regression of variable xⱼ on all other
covariates xₖ (k ≠ j ).

How well did you know this?

Not at all

Perfectly

What is the idea and interpretation of the variance inflation factor (VIF)?

Idea: strong linear dependence between xⱼ and all other covariates results in a high Rⱼ²-> high VIFⱼ .
This is called the variance inflation factor because the higher the dependence between xⱼ and the other covariates, the higher is Var( β^ⱼ ).
Rule of thumb:
- VIFⱼ > 10 implies serious multicollinearity,
- VIFⱼ = 1 would mean that xⱼ has zero correlation with the other regressors

How well did you know this?

Not at all

Perfectly

What other way (except VIF) is there to check for multicollinearity?

test whether regressors are jointly significant -> F-Test for joint significance

How well did you know this?

Not at all

Perfectly

How to detect outliers/influential observations?

Residual analysis: Plot std/stud residuals againt fitted values
Cook’s distance as a measure of influence for observation i

How well did you know this?

Not at all

Perfectly

How do you calculate Cook’s distance?

How well did you know this?

Not at all

Perfectly

How do you interpret the Cook’s distance?

Rule of thumb: Cᵢ > 4/N -> large influence!

How well did you know this?

Not at all

Perfectly

What is the total sum of squares (“Total variance”)?

How well did you know this?

Not at all

Perfectly

What is the residual sum of squares (“Unexplained variance”)?

Study These Flashcards

What is the explained sum of squares (“Explained variance”)?

Study These Flashcards

How do you interpret a high R²?

Study These Flashcards

A high R² means that much of the variance of y is explained by the regressors, i.e. the model fits the data well.

How do you calculate R²?

Study These Flashcards

What is a problem with R²?

Study These Flashcards

R² always increases if further regressors are added to the model, if even these regressors have no additional explanatory power.

How do you calculate the adjusted R²?

Study These Flashcards

How do you calculate the mean squared error (MSE)?

Study These Flashcards

How can you prove the equation below?

Study These Flashcards

What is the definition of the Mean Squared Error of Prediction (MSEP) for model m?

Study These Flashcards

expected squared deviation of true values from predicted values:

Which three parts does the MSEP consist of?

What is the Bias-Variance-Tradeoff, especially w.r.t. the MSEP?

* More flexible models (with more covariates, polynomials of higher order, interaction terms...) reflect the variation in the data better -\> lower bias. * But more flexible models also have a higher variance; and they also are more sensitive to variation in the data that is only random (especially if the data set is small). -\> higher variance, danger of "overfitting".

How can the MSEP be estimated?

* |m| is the number of parameters in the model, * σ² is a consistent (and, if possible, model-independent) estimator of the error variance Var(ϵ) = σ².

For R², MSEP, AIC, BIC and GCV, which model do you choose?

* R²: highest value * MSEP, AIC, BIC, GCV: lowest (absolute) value

How do you calculate the AIC?

How do you calculate the BIC?

How do you calculate the GCV?

How can you use a Validation Sample for model selection?

"split sampling", i.e. randomly split the sample into two halfs. Estimate the model for the rst subsample. Then check the model's performance using the other subsample.

What are potential Variable Selection Procedures?

* Backward Elimination * Forward Selection * Stepwise Procedures

How do you interpret the coefficient for "female"?

it is the difference between male and female with 0 experience, keeping all other variables constant

Tutorial 3 - Model Misspecification, Model Choice, Model Diagnostics, Multicollinearity Flashcards

(34 cards)