Selection on (Un)Observables: Selection Correction Flashcards
Cluster-robust standard errors
Now, classic OLS makes two assumptions concerning this matrix: 1. that E[εi|X,D] = σ2 (equal variances, or homoscedasticity), 2. and that Cov[εi,εj|X,D] = 0, for all i ̸= j.
This latter assumption means that error terms (which are the deviations of the expected values of Y) for any two observations i and j, i,k = 1…6 are uncorrelated.
The problem now arises because the standard formulas (e.g. those used by Stata) to compute standard errors of the coefficients βˆ, δˆ assume that all η = 0. It does not affect βˆ, δˆ itself (no bias!). Fortunately, one can usually specify a ‘robust’ or ‘cluster robust’ option and it is all taken care of.
Binary outcome models
When my outcome Y is binary (either 0 or 1), then fit models ‘predicting’ the probability for individual i to have Y = 1.
Can assume the line on which these probabilities lie to take different functional forms.
- linear probability (OLS) assumes a straight line
- probit assumes the CDF of a normal distribution
- logit assumes the CDF of a logistic (very similar to normal)
Tobit Model
suitable for censured data. Essentially assumes my X to affect two things:
- likelihood of Y > 0
- value of Y provided that (or ‘conditional on’) Y > 0.
Predicted probabilities of a tobit are therefore
E[y|x] = Pr(y > 0|x) E[y|y > 0, x],
> not straight forward to interpret.
Problems with Sample selection
Estimating effects based on a sample that is not randomly drawn from the population can produce bias. Systematic selection into the sample on which data is available is such a case (‘sample selection’).
Examples: we only observe wages of people who actually work.
- migration on earnings (decision to migrate likely to be driven by unobserved factors that also determine pay)
- family holiday expenditure (number of kids affects decision to go on holiday, and how much is spent once on holiday)
- institutions (decision to adopt certain institutions depends on factors that also matter for their effect once they are adopted)
Important: Difference to selection of treatment assignment is that here, selection determines whether we actually observe Y for certain subjects at all.
Sample selection bias
The bias arises through the error term (i.e. unobserved factors).
Take the education-wage example:
- lowly-educated people are most likely to have a job if they have good other skills
- such skills are usually unobserved, i.e. part of the error of our model
- and they affect the wage, which is the outcome Y
→ sample contains systematically more people with high unobserved skills
→ OLS (or any other uncorrected model) ends up producing biased coefficients
Sample Selection in causal graph
- sample selection problem different from ‘causal situations’ looked at so far
- here we actually want to know the effect of some X (education) on Y (wage), not the causal effect of D
- problem is Y being unavailable if D = 0
Selection correction (basic idea)
- explicitly model the selection process (‘selection stage’ or ‘selection equation’)
- yields an estimate of the likelihood of every observation to be in the sample
- this information is used to calculate the so called inverse Mills ratio (IMR)
- in a separate equation we model the outcome of interest (‘outcome equation’)
- including the IMR as a variable corrects for selection bias
- think of the IMR as the correlation between error in the selection equation, and the error of the outcome equation without selection correction
Selection correction (mathematical) 1
consider first the selection equation:
Pr(di =1)=Φ(Zi +β)+εi,
which determines whether we observe a wage for individual i or not (di = 0, 1).
- Z is a set of independent variables, here including education level
- β is a coefficient vector
- ε is the error term of the selection equation
Consider now the wage equation:
w =α+Xγ+u, iii
α is a constant, X contains all or a subset of variables in Z , γ is a coefficient vector, u is the error term of the wage equation.
Slection correction (mathematical) 2
- estimate the wage equation for all observations with a wage (all of whom have s = 1) then our γˆ is biased.
I.e., Cov (εi , ui )≠ 0, implying that also Cov (εi , Yi ) ≠ 0, which violates an assumption essential for unbiased estimation. However, if we now
1. estimate the selection equation to obtain a βˆ
2. calculate the so called inverse Mills ratio: IMR = ρ = φ(Zi βˆ) / Φ(Zi βˆ) 3. and estimate the wage equation with this as a variable, that is wi =α+X′γ+ρρ+ui,
the resulting γˆ is consistent (i.e. unbiased with large samples).
⇒ Intuition: in a non-randomly selected sample the IMR is an omitted variable, and inclusion of it takes out the omitted variable bias.
Confounders of causal sector effect estimation
- wage determination processes may differ between the sectors
> differences in returns to skil; differences in regulation; trade-off between high pay and job security; symbolic rewards and intrinsic motivation may be substitutes for pecuniary rewards - most importantly, employees self-select into sectors according to
> preferences over high pay versus job security, symbolic versus monetary rewards, etc.
> their anticipated net utility in either sector (i.e. expected returns minus expected effort)
> trade-off between high pay and job security
> symbolic rewards and intrinsic motivation may be substitutes for pecuniary rewards
⇒ Many of these factors are generally unobserved and affect wages.
Roy model (aka ‘endogenous switching regression model’)
Consider a sector selection equation (public sector D = 1, business sector D = 0),
D = 1 if (log w1 −log w0)+Z′βS +εS
D = 0 if (log w1 −log w0)+Z′βS +εS
Rewrite this as a binary outcome model:
Pr(D = 1) = F[Z,βS,(log w1 − log w0),εS].
Consider further sector wages to be determined as follows:
log w1 = X′γ1 + u1, l
og w0 = X′γ0 + u0.
Features of Roy setup
- the wage and selection equations are mutually dependent, i.e.
- Di indicates which wage equation determines the wage of i
- at the same time, the sector choice of i, Di , depends on log wi1 − log wi0
Sector wages: why not run 2 OLS?
- problem: if we were to OLS-estimate a wage equation each for all public sector and all private sector employees
- we’d have bias for the same reasons as in the Heckman model
- suppose public sector jobs are sought-after, and generally highly-educated work there
- some less-educated may also manage to get a public sector job, the ‘causes’ for this are most likely unobserved and thus in the error εS
- since the same ‘causes’ usually affect wages, Cov (u0 , εS ) and Cov (u1 , εS ) ̸= 0
- however, separate OLS-estimation of the wage equations implicitly assumes that
Cov(u0,εS) = Cov(u1,εS) = 0
How do you correct for sample selection in sector wages?
include the inverse Mills ratios, ρi , obtained from the selection equation, as an additional variable with coefficient ρ in the wage equations:
log w1 = X′γ1 +ρ1ρ1 +u1,
log w0 = X′γ0 +ρ0ρ0 +u0.
Interpret Roy results
Having unbiased coefficient estimates γˆ and ρˆ …
- … we can predict the public and private sector wages for particular value combinations of X
- for example, if in X we have age, schooling, and gender, we can predict wages log wˆ , log wˆ for a 30-year old, female, with Abitur in both sectors
10
⇒ the difference log(wˆ |X) − log(wˆ |X) is the effect (in percent, because of the 10
logs) on the wage of a switch from the private to the public sector for a person with characteristics X