Misspecification and Model Selection Flashcards
3 types of misspecification
Omitted relevant variables (underfitting)
Inclusion of irrelevant variables (overfitting)
Incorrect function form
Omitting relevant variables:
In a misspecified model, what happens to our estimate of β₁?
β~₁ - which is β₁ plus something else
What is the final expectation of β~₁
E(β~₁) = β₁+β₂ x Cov(Xi,Zi)/Var(Xi) ≠ β₁
Zi is the relevant variable that has been omitted.
And obviously ≠ β₁ since omitted Zi, so misspecified and OLS is biased!
So OLS is biased in this underspecified model:
Unless.. (2)
Cov(Xi,Zi)=0 (Zi unrelated to Xi)
β₂=0 i.e Zi is not actually relevant!
How do we know the sign of bias, i.e know if we are overestimating or underestimating β₁?
If covariance and β₂ have same signs i.e both > 0 or both <0 , we get positive bias i.e overestimating β₁.
if they have opposite signs i.e Cov>0 but β₂<0, or Cov<0 and β₂>0 , we get negative bias, underestimate β₁
(and of course if either =0 no bias! as mentioned in previous FC)
Suppose we estimate wage on extra year of education.
True model is
ln(wagei) = β₀+ β₁ Educi + β₂ Abilityi + εi
But of course ability is hard to measure. So we omit it.
What is our β~₁ (use formula)
β~₁=β₁+β₂ x Cov(Educ,Ability)/Var(Educ)
using this, would our estimate of β₁ be positive (overestimated) or negative (underestimated) bias?
Since we expect Cov(Educ,Ability) to be postively correlated, and β₂ to be > 0…
β~₁=β₁+β₂ x Cov(Educ,Ability)/Var(Educ) > β₁
Positively biased! Overestimated
So β₁ likely upward biased: (return to education is bigger than true value)
How can we proceed from here (3)
Measure ability (HARD!)
Experiment: give random amount of education to people
Quasi experiments i.e replicate in natural settings
Detecting an omitted relevant variable
Consider true model again
Yi = β₀ + β₁ Xi + β₂ Zi + εi
But we estimate the misspecified model
Yi = β₀ + β₁ Xi + vi
How to test? Natural suggestion would be to specify
vi = γ₀ +y₁Zi + εi (Z is contained in error term v)
And test whether γ₁=0 (if not, Z is a relevant!)
Problems with this suggestion (2)
vi is not observed.
Eval: take residuals from misspecified to make an estimate of vi v^i)
We dont know what Z is (otherwise would’ve included it in our model i.e the true model)
So no good test for omitted relevant variables, so use economic theory and intuition and think about what bias they would bring to the parameters
2nd misspecification: Including irrelevant variables
Consider model:
Yi = β0 + β1 Xi + β2 Ii + εi
Where I is irrelevant. What does it mean for the coefficent β₂
The true population coefficient β₂ is 0
So I is irrelvant, and so we should get β^₂=0
What happens for our estimate of β₁ , and why?
Nothing - still unbiased.
True model is just a restricted version of estimated model (since β₂=0) so treat like doesn’t exist anyway!
So under classical assumptions: what does this mean for E(β^j)?
E(β^j) = βj for all values
e.g
E(βˆ₀) = β₀, E(βˆ₁) = β₁, E(βˆ₂) = β₂ = 0
So bias is not an issue for inclusion of irrelevent variables: but what is?
bias-variance tradeoff
Bias variance tradeoff captured: First part: Bias
True model
Yi = β0 + β1 Xi + β2 Zi + εi
where β₂ may = 0
What are 2 estimates of β₁?
β^₁ from: Y^i = βˆ₀ + β₁ˆXi + βˆ₂Zi
β~₁ from: Y~i = β~₀+β~₁Xi
Recall omitted relevant variable bias: if β₂≠0 and Cov≠0 , β~₁ is biased, while B^₁ isn’t.
2nd part of Bias-Variance tradeoff:
Variance of the 2 estimators
B) which has a lower variance (unless)…
Var(β~₁) = σ² /Σ(Xi - Xbar)²
Var(β^₁) = σ²/(1-Rzx)Σ(Xi-Xbar)²
β~₁ is better, unless R²zx=0 (Xi & Zi uncorrelated)
So using β^₁ better for unbiasness (unless β₂=0 and Cov=0 so β~₁ is also unbias ), B~₁ better for variance unless R²zx=0