Selection on Observables: Regression and Matching Flashcards
Regression in terms of the POM
Write the observed outcome Y in terms of
the treatment state D=0,1and the potential outcomes YT and YU:
Y = DYT+(1D)YU
and then re-arrange and substitute
= YU+(YT - YU)D
= μU+delta D+vU,
where μU = E(YU), is the effect of D, which is assumed not to differ between the T and C groups, and vU = Y U- E (Y U ). Interpret this latter term as a deviation from the expected outcome.
However, more realistically may differ across groups, which is taken account of in
Y = μU +(μT - μU)D+[vU +(vT -vU)D],
with analogue definitions of μT and vT.
Regression in terms of POM II
Y = μU +(μT - μU)D+[U +(vT - vU)D]. We can view this as similar to a regression equation with
- μU = E(YU) as an intercept,
- μT - μU = E(YT ) - E(YU) = ATEtrue as the regression coefficient, and
- vU +(vT - vU)D as an error term,
However, when we estimate this to get deltaˆOLS , we get deltaˆnaive (see first lecture).
OLS bias
- vU is correlated with D
> evenif(vT - vU)D is uncorrelated with D,the error term is correlated to D - or (vT - vU )D is correlated with D
> even if U is uncorrelated with D, the error is correlated to D - a baseline level difference in Y depending on D (selection bias)
- a difference in the size of the treatment effect depending on D (treatment effect bias)
Controls in OLS
If we control for X, we might arrive at D is uncorreltaed with error|X because regression adjusts estimate of effect of D on Y for the influence of X on Y
> OLS ‘cleans’ the naive estimate of yi on di from their common dependence on X
Reasons for non-causality of OLS with controls
- ‘cleaning’ depends on the variables in X and whether their specification is ‘correct’
- the most flexible specification is the best cleaning > ideally, specify a saturated model
- if successful leads to vU being uncorrelated with D
- however, (vT - vU )D may still be correlated with D because the treatment effect differs in each stratum (eefect heterogeneity)
- in that case, even if we control for X in a saturated specification, this can create bias due to the property of the OLS procedure, which overemphasizes outliers (note: ‘ordinary least squares’)
Bias of OLS with effect heterogeneity
- OLS undertakes an implicit weighting of the marginal treatment effects, which depends on the variance of the treatment probability within each stratum of X, not only on the marginal distribution of individuals in each stratum
- a particular form of re-weighting can solve this problem (using Weighted Least Squares, WLS) if the weights offset the implicit variance weights of OLS, but why then use OLS in the first place
- -> bottom-line, OLS best as a descriptive tool and usually just a first cut at approximating some ‘effect’
- -> matching is conceptually clearer and usually more precise (if treatment selection on unobservables is ignorable!)
Matching basics
I suppose you have sample of treated and untreated subjects,
who differ along S (and Y , obviously)
I matching means finding an untreated ‘match’ for each treated case (and vice versa) that is identical in terms of S
- this kind of exact matching is equivalent to stratifying on all observables S (perfect stratification)
- in theory, this means we achieve unit homogeneity
- for either of these techniques to give us the true ATE, two assumptions must hold
E(YT|D=T,S) = E(YT|D=U,S),and
E(YU|D = T,S) = E(YU|D = U,S).
conditional ignorability of treatment assignment
> note: it remains true here that if only the first (second) holds, we can estimate the true ATU (ATT)
Practica Matching problems
- in practice perfect matching, this is rarely feasible because S has too many dimensions (i.e. variables) for perfect matches to exist
- variables in S take on so many values that forming (discrete) strata is not feasible
- even with ‘simple’ S, Pr(D = 1|S) may be 0 for some combinations of s1, s2, s3, … so that ATU is not meaningful>
I therefore rely on each subject’s ‘propensity’ to have D = T
Propensity score
A propensity score summarizes the information from all variables in S and collapses it into one dimension or ‘score’
- caveat: we don’t know true propensity scores
- usually estimate them (e.g. using a binary outcome model, such as logit or probit), Pr(D = T|S) = F(S,Beta)
then predict dˆ(s ) = pˆ = F(S,Beta^)
- this procedure imposes functional form
- the propensity model may be misspecified
- strictly speaking, this usually introduces bias (not necessarily the same as in OLS) I thus, good research practice is to report OLS and matching results
Metrics of distance in propensity scores
- difference in the propensity score (piˆ - pjˆ )
- odds of the propensity score (rˆ = pˆi / (1 - pˆi))
- the Mahalanobis metric
Types of matching procedures
- exact matching
> matches k counterfactual cases (k = 0…K) that have exactly the same values in S;
the weight is 1/k
- nearest neighbour matching
> matches the k counterfactual cases that are closest to the treated case along the chosen metric of distance; the weight is 1/k
> nn-caliper matching allows to specify a maximum distance to avoid bad matches (e.g. if data is sparse) - interval matching > data is divided into segments along a metric of distance; effect is calculated within each interval and then integrated; the weight is 1/k(int)
- kernel matching > constructs the counterfactual for every treated case from all control cases, with weights assigned according to similarity on some measure of distance (usually the propensity score)