Lecture 2 (Matching and related methods) Flashcards
When do we say that something is identified?
“ $\beta$ is identified in the sense that we can write $β $ as a function of only observable variables (population moments of them)”
E.g., Under random assignment
$$
ATE = E[Y|D=1]-E[Y|D=0]
$$
We say that $ATE$ is identified because it can be expressed in terms of the observed variables, $Y, D$.
What do we mean with partial identification?
Partial identification in econometrics is an approach to conducting inference on parameters in econometric models that recognizes that identification is not an all-or-nothing concept and that models that do not point identify parameters of interest can, and typically do, contain valuable information about these parameters.
The partial identification approach to inference recognizes that this process should not result in a binary answer that consists of whether the
parameter is point identified.
In statistics and econometrics, set identification (or partial identification) extends the concept of identifiability (or “point identification”) in statistical models to situations where the distribution of observable variables is not informative of the exact value of a parameter but instead constrains the parameter to lie in a strict subset of the parameter space.
Manski developed a method of worst-case bounds for accounting for selection bias. Unlike methods that make additional statistical assumptions, such as Heckman correction, the worst-case bounds rely only on the data to generate a range of supported parameter values.
Hence, we can create ATE bounds. Combining $ATE_{max}$ and $ATE_{min} $ gives the upper and lower bound on ATE. These bounds are wide but apply under any assumptions about the treatment assignment.
What is unconfoundness?
Rosenbaum and Rubins’s (1983) definition of unconfoundedness:
$$
D \perp Y(0),Y(1)|X
$$
Unconfoundedness requires that conditional on observed covariates there are no unobserved factors that are associated both with the assignment and with the potential outcomes.
What is meant by overlap (common support)?
Formally, for identification, we also need a second assumption: common support (overlap).
$$
Pr(D=1|X=x) \in(0,1), \ \forall x
$$
That is, for all possible values of the covariates, there are both treated and control units.
What is a propensity score?
Using the definition of unconfoundedness, like we can condition on $X$, we can condition on $e(x)$, which is an individuals propensity (score) to be treated given $X$. That is,
$$
e(x) = Pr(D=1|X=x)
$$
The propensity score is a ”balancing score”. We can use the propensity score in different ways.
We might never observe the true propensity score (know the assignment mechanism), thus, we usually have to estimate the propensity score, $\hat e(X)$
The propensity score is estimated with e.g., a probit or logit model.
We can use propensity score for many different things. E.g., weighting or matching.
What is the inverse probability weighting estimator? IPW
\hat \tau^{IPW}{ATE} = \frac 1 N \sum{i=1}^N \Big[ \frac{D_i Y_i}{\hat e(X_i)}-\frac{(1-D_i)Y_i}{1-\hat e(X_i)} \Big]
See propensity score matching.
The propensity score is estimated with e.g., a probit or logit model.
What do we need when we do propensity score weighting (more general than data on variable x,y,z etc)
In total, this approach requires:
- a correctly specified regression model
- a correctly specified propensity score model
Misspecification in either model may lead to misleading estimates!
The solution is to instead ur a “doubly robust” estimator
How to the procedure look like with propensity score matching?
The process is the following:
- Use the treatment variables and covariates to estimate the propensity score using a logit or probit model. This model then creates $\hat e (X_i)$.
- Find the closest match in terms of $\hat e (X_i)$ for each treated and untreated. That is, those with the most similar propensity score.
- Estimate ATE or ATT.
What can we never do with propensity score matching?
With propensity score weighting, we can not use bootstrap!
The reason we can’t use bootstrapping is that one individual can be the closest neighbor to many other individuals, not sampling this one would then make a huge difference to our estimation. We should instead use Imbens’s (2006) method or Kernal-matching with bootstrap.
Is it bad or good to use propensity score matching?
While Dehejia and Wahba (2002) revisited the LaLode paper and showed that propensity score matching could fix the bias-problem, Smith and Tood (2005) showed that these results were very sensitive to the choice of control variables etc.
What is the intuition of weighting?
The outcomes of the treated are weighted towards the full population. For instance, if there are fewer females among the treated than in the full population, the outcomes of the treated females are given larger weight to capture the gender distribution in the full population
What is the intuition with the Doubly robust estimator?
Remember that for propensity weighting we needed
- a correctly specified regression model and
- a correctly specified propensity score model
The doubly robust approach combines the regression approach and the propensity weighting approach. This only requires that either the regression model or the propensity model is correctly specified. Therefore it is favored by theoretical work and simulations.
That is, it protects against both misspecification of the propensity score and the regression model so that we obtain correct estimates if only one of the two is correct.
Explain the intuition behind the nearest neighbor matching.
There are different kinds of matching procedures. One is nearest-neighbor matching. This estimator imputes the missing potential outcomes using only the outcomes of a few nearest neighbours of the opposite treatment group. They are “nearest neighbour” in the sense that $X$ are similar.
Using only a single match leads to the most credible inference with the least bias, at the cost of sacrificing some precision.
For each treated and non-treated we use the outcome of the nearest neighbor to impute the missing potential outcome.
Show the matching estimator
\hat \tau^{Match}{ATE} = \frac 1 N \sum{i \in D_i =1}^N[\hat Y_i(1) -\hat Y(0)]
where $\hat Y(0)$ is the nearest neighbor. Cold bee in terms of propensity scores.
What are ATT unconfoundedness and conditional mean independence? How are they different?
Original unconfoundedness:
$$
D \perp Y(0),Y(1)|X
$$
ATT unconfoundedness:
$$
D \perp Y(0)|X
$$
Conditional mean independence:
$$
E[Y(0)|X,D] = E[Y(0)|X]
$$
Original unconfoundedness and ATT unconfoundedness are in theory different, but not in practice. The same goes for ATT unconfoundedness and conditional mean independence.
Choosing to use the notation of ATT unconfoundedness or conditional mean independence, depends on the school of thought, statistics vs economics.