Lecture 15 Flashcards
What is truncation?
Occurs when some observations are completely missing from your data because of how the sample was selected
- not just unobserved - entire data point is missing
How to use selection variable with a model which may include truncation:
Si = 1 if unit i is observed and = 0 otherwise
- si.yi = Bt.sixi + si.ui
- when si = 1 we have the normal model, otherwise 0 = 0
Running OLS on 2 is equivalent to running OLS only on those observations selected out of the n initial draws - PROVE.
- big concern is selection bias, if the selection process is related to the unobserved error term ui, then: E(ui|si = 1,xi) is not = 0, now OLS is no longer unbiased
What does selection bias in the truncated model set us up for?
- Model the selection mechanism
- Correct for the selection bias, using more advanced methods
Conditions to maintain consistency in OLS
- E[su] = 0, on average, the selection mechanism s must not be correlated with the error u
- also need E[(s.xj).(su)] = E[s.xj].E[u] = 0 - Stronger condition E[su|sx] = 0, means even after conditioning on selection and covariates, the selection-adjusted error has mean 0
- Cov(siui,sixi) = E[si.ui|xi] - E[si.ui].E[si.xi]
So when does truncation not hurt OLS?
Only when the selection mechanism s is completely independent of the error u, even after controlling for x, that’s a strong condition and usually, it doesn’t hold - so we need corrections like Heckman’s
If selection is completely independent of both x and u
Then E(s.xj.u) = E(s).E(xj.u) = 0
- so OLS is consistent
If selection only depended on the covariates x, not on unobservables
S = s(x)
- then all variation in s is explained by x so:
E(u|s.x) = E(u|x) = 0 - can prove this, and thus E(su|sx) = s.E(u|sx) = 0
- makes OLS valid and consistent on the selected sample, even if we’re not seeing the whole data
Identity for independence:
- W is independent of Z iff?
P(W|Z) = P(W), or P(W,Z) = P(W).P(Z)
- then, E(WZ) = E(W).E(Z)
Simpson’s paradox:
- y = Bt.x + u, E(u|x) = 0
- assuming a linear conditional expectation of y given x, which is fundamental to OLS being valid
- each group can have a positive relationship between x and y, but when you pool the data across all groups, overall regression line can be negative
- happens when a latent variable is correlated with both x and y, not accounted for in the model
- assumption E(y|x) = xt.B can fail, which breaks the mean independence assumption and the nice properties of OLS.
The source of truncation matters, two cases:
- If truncation depends only on x, e.g. s = 1 if x1 > 2, then E(u|x,s) = E(u|x) = 0, OLS still consistent
- If truncation depends on y, so e.g s = 1 if y < c, that’s truncation based on outcome, but selection rule now depends on the unobservable u, so E(s.xj.u) is no longer 0, sample is biased with respect to u
Truncation based on x vs y for OLS
Truncation based on x doesn’t break OLS, truncation based on the dependent variable makes OLS inconsistent as it correlates with the error term.
So now how to tackle the mode below for truncation based on the dependent variable:
- Y = Bt.x + u, u|x - N(0,o^2)
- we observe (xi,yi) only if y < ci
This is left truncation, only values of y below c are observed
- we want the density of y conditional on being observed, i.e. on yi < fix
- f(y|x,B,o)/F(ci|xi,B,o)
- where f. Is the normal PDF of y with mean Bt.x and variance o^2
- F. Is the normal CDF up to the cutoff ci
Corrects for the selection bias introduced by the truncation and allows us to construct a likelihood function using only the observed data, estimate B and o via MLE
Incidental truncation - what is it
We only observe y for some of the population, and whether we observe it depends on some other decision process, which may correlate with y
- e.g. if y is wages, we only observe data on wages for people who work, and that participation decision can depend on multiple factors, so our sample is no longer random
Setup for a sample selection problem
Outcome equation:
- y = Bt.x + u, E(u|x,z) = 0
Selection equation:
- s = 1[yt.z + v >/ 0], a latent index model, only if a linear function of z + v is positive.
E(y|z,v) = Bt.x + E[u|z,v]
- we assume u,v are jointly normal and independent of z
E(y|z,v) = Bt.x + pv
For the sample selection model, what happens when we only observe y if s = 1?
S = 1, means v >/ -Yt.z
- since v is standard normally distributed,
E(v|s=1) = k(yt.z), where k is the inverse mills ratio
E(y|z,s=1) = Bt.x + pk(yt.z), we recover unbiased estimates of B only if we control for the selection bias term k(yt.z)
Heckman correction process
- Estimate the selection equation, a profit to get y^
- Compute k^(yt.z)
- Include this as a regressor in the outcome equation only for the observed sample:
Yi = Bt.xi + pk(y^t.zi) + error - Run OLS on this new equation using only the selected observations
What’s going on conceptually in Heckman correction
Correcting for the non-random selection into the sample
- by including k, you’re adjusting for the fact that the sample you’re estimating on is not representative of the full population, due to the selection mechanism.
So when is OLS consistent in the presence of selection
- we only observe y when s = 1, so whether OLS will be consistent on this selected sample depends on relation between u and v
Case 1: p = cov(u,v) = 0, independent so OLS is consistent in IM ratio is irrelevant
Case 2: p does not = 0, means selection mechanism is informative about outcome error, now use the profit model to estimate y
Why does the Heckman correction works
- when selection is endogenous, the conditional mean of u given selection is non-zero and depends on z
- IM ratio captures this dependence
- by including k(.), we control for that selection bias, turning a biased regression into a consistent one.
Caveats in truncation
- SEs need adjustment
- after doing the 1st step estimation of selection equation, so estimating y^ from the profit, use k(y^t.zi) in the second step
BUT, this 2 step introduces generated regressor bias, as k(.) is estimated not observed, so OLS SEs understate uncertainty unless you correct them, use robust
Caveats in truncation
- overlap between x and x, identification concerns
Ideally, the set of variables in the selection equation z includes at least one variable not in x, called an exclusion restriction
- multicollinearity can occur if x = z, so include some variables in z that are excluded from x, improving identification.