Lecture 12 Flashcards
(28 cards)
When does endogeneity occur?
When the x term is correlated with the error term
When cov(x,u) does not = 0
When could cov(x,u) not = 0 arise?
What could it mean?
- omitted variables
- simultaneity, so dependent and independent are determined simultaneously, so there is a feedback loop - like price n quantity in Supply and Demand diagrams
- measurement error means observed x deviates from true x, which might cause correlation with u
Would mean the standard OLS b0^ and b1^ would not be consistent
Example of simultaneity
Let’s say we want to see if gov spending causes less unemployment
- but gov often spends more in areas with higher unemployment
- therefore unemployment itself influences gov spending
- thus, if we ignore reverse causality, could misinterpret the correlation, like seeing higher spending and unemployment which indicates a positive correlation - misleading.
Unbiasedness does not mean consistency
Unbiasedness applies to small samples, so an estimator is unbiased if on average it hits the true value of the parameter in repeated samples. E[u|x] = 0
Consistency applies to large samples, so estimator is consistent if as the sample size grows infinitely large, estimates converge to true value. Cov(x,u) = 0
SLR.4 implies cov(x,u) = 0, but not vice versa
Whats the basic idea of instrumental variables
Introduce a 3rd variable z, which affects x but not u, helps isolate the variation in x which is exogenous to u
2 key IV assumptions
- consider yi = B0 + B1Xi + ui, where cov(u,x) is not 0
- cov(zi,ui) = 0, exogeneity condition here is theoretical and cant be tested as it depends on ui - which is unobservable
- cov(zi,xi) is not 0
Basically z is unrelated to u, and z affects yi only through xi
How to test for instrument relevance, i.e. that Zi affects xi
Xi = pi0 + pi1zi + vi
Since pi1 = cov(zi,xi)/var(zi), we MUST test relevance
Perform t tests
- H0: pi1 = 0
- H1: pi1 is not 0
IV estimator, B1
First: cov(Zi,yi) = B1.cov(Zi,xi) + cov(Zi,ui), second is = 0 as we’ve assumed
- rearrange:
B1 = Cov(zi,yi)/cov(zi,xi), then divide top and bottom by var(zi)
- gives you slope coefficient estimator from the reduced form divided by the slope coefficient estimator from the first stage
- same as (Slope from regressing y on z) / (slope from regressing x on z)
B1^ = OLS estimator, but with z instead of x
Special case of IV: Wald Estimator
When the instrument z is binary:
- E[yi|zi=1] = B0 + B1E[xi|zi=1]
- E[yi|zi=0] = B0 + B1E[xi|zi=0]
E[yi|zi=1] - E[yi|zi=0] = B1(E[xi|zi=1] - E[xi|zi=0])
Rearrange for B1, then in sample you replace the expectations with the sample averages to have the Wald Estimator
Basically, the difference in mean y between instrument groups devised by difference in mean x between instrument groups
Variance of IV estimator:
Var^(Biv^) =
Assuming homoskedasticity:
(O^2^)/(SSTx(Rx,z^2))
- SSTx = sum(xi-x_)^2
- the r^2 from a regression of xi on zi and an intercept, tells you how well z predicts x
- o^2^ = (1/n-2)SUM(ui^2^)
IV vs OLS
Advantage of IV: consistent even if u and x are correlated, in which case, the OLS estimator is biased and inconsistent
Disadvantage of IV estimator: less efficient if u and x are uncorrelated
Variance of the IV estimator is always larger than the variance of the OLS estimator and depends crucially on the correlation between z and x, can look at formulae to compare
Tradeoff: gain consistency at the cost of precision
Weak instruments and Bias:
- weak instrument means that z and x are only weakly correlated, so leads to imprecise IV estimates, but also can give large bias
- mathematically, if the denominator is small, so a weak instrument, the second term becomes very large - representing a lot of bias
Denominator = how strongly z predicts x, if this tends to 0, estimator becomes unreliable and sensitive to small changes in data
What is the rule of thumb with weak instruments
- if instruments are weak, sampling distribution is not well approximated by normal, even in large samples
RoT - F statistic above 10, same as t statistic above root 10 means its roughly strong enough
IV in the MLR model
To consistently estimate all of the Bs, we use the sample analogs of the moment conditions:
- E[ui] = 0
- cov(ui,zi) = 0
- cov (ui,xi2) = 0
Where xi2 is the exogenous explanatory variable, unlike xi1
Solve 3 equations, 3 unknowns.
IV in the MLR, what happens between the exogenous and the endogenous explanatory variables?
Z must be correlated with x1, correlation must hold even after controlling for x2
- to verify validity of z as an instrument, perform t test when regressing zi and xi2 on xi1, with pi = 0 or not
Exogeneity condition is now: cov(zi,ui|xi2) = 0, meaning after controlling for xi2, zi should have no correlation with ui
What about the 2SLS model?
- how is it different to what we have so far?
- why use it?
- what is the test for instrument relevance?
- multiple instruments Z1,…Zn, so that first stage regression of endogenous variable x1 on them is longer
- multiple instruments improve the precision of estimates and allow for overestimation tests
- H0: pi1 = pi2 = … = pin = 0, F test across multiple instruments, aim for F>10
2SLS, step-by-step model
- Estimate the first stage regression, regressing the endogenous explanatory variable on the instruments and all the other exogenous explanatory variables, do relevance F test too
- Compute the predicted value of x1, xi1^
- Yi = B0 + B1xi1^ + B2xi2 + ei, regressing the outcome variable on xi1^, and all the other exogenous explanatory variables
Coefficient on xi1^ is the 2SLS estimate of B1
Get the SE and 1st stage F stat
Potential issues with adding instruments
- adding instruments with low predictive power in the 1st stage lowers the F statistic and exacerbates the bias in the 2SLS estimator, makes estimators tend to OLS estimates
Testing for endogeneity: Hausman test
H0: cov(xi1,u) = 0 and H1 the opposite
- in the null, both OLS and IV are consistent, IV only consistent in the alternative
- Run the 2SLS 1st stage ,vi is the residual, capturing part of xi1 not explained by Zi
- Calculate the 1st stage residual, xi1 - xi1^ = vi^
- Add vi^ to the regression model, and estimate by OLS
- If xi1 is exogenous, vi^ should not be correlated with ui, so theta should be 0
- Test this using a t test
Difference between over-identification and just identified.
- why does this matter
- if we have exactly as many instruments as endogenous variables, model is just identified - exogeneity not testable
- if we have more instruments than endogenous variables, the model is over-identified.
- overidentification allows for validity testing - we can check whether instruments satisfy the exogeneity condition
Multiple endogenous variables, what to do,
E.g., 3 regressors, but 2 endogenous
X1 and x2 potentially correlated with u
- need at least 2 instruments that:
1. Don’t appear in the main equation for y
2. Satisfy the relevance condition
3. Satisfy the exogeneity condition
Rank condition = instruments must be correlated ENOUGH with the endogenous variables so you can actually estimate the coefficients
Testing overidentification, Sargan Test
- Estimate the 2SLS regression and obtain residuals ui^
- Regress residuals on all excluded instruments, and any other exogenous variables in the model - record the R^2 from this regression
- Compute nR^2, in a chi squared test, M-1 degrees of freedom, null is that all IVs are exogenous
If IVs are valid, 2SLS residuals should be uncorrelated with instruments
Difference between LATE and ATE:
- LATE is the effect of treatment on outcome for subgroup of individuals whose treatment status is affected by the instrument
- ATE is the effect of treatment on outcome averaged across entire population
When can LATE = ATE
Biv = E[B1,pi1]/E[pi1] = E[B1i]
- when causal affects are homogenous
- everyone’s TE is the same, so weighting doesn’t matter - so B1i = B1 - Homogenous 1st stage, instrument affects all individuals equally, LATE equals ATE as there is no subgroup variation in how Z influences X, so pi1i = pi1
- when the heterogeneity in the TE and in the effect of the instrument are uncorrelated, E[B1ipi1i] = E[B1i].E[pi1i]