Clustering and bootstrap Flashcards
What is meant by iid and why is it important?
The most important assumption we do in cross-sectional inference is that observations are iid: Each observation $i$ is treated as a random draw from the same population, and is therefore uncorrelated with the one before and after. However, in most cases, the iid assumption is too restrictive, as there is usually some form of dependence on the data.
What will happen if we assume iid med we have correlations within clusters?
If we assume that observations are iid but, in reality, they are not, we get wrong s.e. (usually, too low). This affects the t-stat.
Thus, we will tend to reject the null too often. I.e., make type-1 errors.
Often dependence arises when we have grouped data. That is, our sampled individuals belong to some group which makes them similar thus their unobservable variables are likely correlated, so they are not iid. This could be e.g.,
- Wages for workers in the same firm
- Observation for the same individual over time (panel data)
This creates a clustering problem.
How does cluster robust standard errors change our SE en point estimates?
Using heterogeneous robust or cluster robust SE’s does not change the point estimates, only the SE’s and so the significance.
When do cluster-robust variance (SE) become higher than hetero-robust variance (SE) `
When we have correlations within clusters
- The higher correlation we have in each cluster
- The more observations we have in the same cluster
Show the cluster robust variance formula
\widehat{Var}{clu} = (\bold{X’X})^{-1}\bold{\hat B}{clu}(\bold{X’X})^{-1}
where
\bold{\hat B}{clu} = \sum{g=1}^G\bold{X’_g\hat u_g\hat u_g’X_g}
When asymptotic theory apply with cluster robust errors?
For the asymptotic theory to apply, we need the number of clusters G to be large.
Explain the Moulton factor?
The Moulton factor tells us how much we over-estimate precision by ignoring intra-class correlation. This is given by
$$
\frac{\widehat{Var}_{clu}[\hat\beta]}{\widehat {Var}[\hat \beta]}
$$
if we get e.g., 7 as a result. we should take $\widehat{SE} \times \sqrt 7$ to get the correct standard errors.
What to cluster over? When should and shouldn’t we use it?
One should cluster at the level where one thinks both regressors and errors might be correlated within cluster. However, we need also to keep in mind that we only approach the true variance as $G \to \infin$.
Hence, if we define very large clusters, there will be few clusters to average over, and the resulting estimated clustered variance will be a poor estimate of the true variance.
As a general rule, cluster at a progressively higher (broader) level and see if s.e. change significantly. Be conservative and show the largest s.e. However, we need a clear argument of we choose to cluster at the level we do.
If the clustering on a higher level do not yield significantly larger errors, we should use the errors on a lower level.
→ The bottom line is: use the cluster estimator if you can.
How should one think regarding clustering and fixed effects?
If you included cluster-specific fixed effects, αg and you believe they capture all the cluster correlation, you may not cluster. I But often some within-cluster correlation remains and the error may still be heteroskedastic.
→ The bottom line is: use the cluster estimator if you can.
What is multiway clustering and what do we need to think about?
Imagine you have state-year panel of $N$ individuals and you worry about clustering across states and over time. A solution is to use a two-way clustering estimator.
In the example before, we need both the number of states $S$ and of time periods $T$ to be large.
What is the issue if we have too few clusters?
When $G$ is small, the CRVE underestimates the true variance. Then the t-statistic is too
large and the t-test over-rejects the null. So actually, clustering at a higher level could infact decreese our standard errors compared to clustering at a lower level.
The solution is to cluster at the higher and correct level, but use Bootstrap.
What is bootstrap and how does it work?
The idea behind the bootstrap is to consider the sample we have as if it were the population of interest. Instead of drawing more samples of size N from the population distribution, the bootstrap draws with replacement from the original sample itself, using the empirical c.d.f. of the data as if it were the population distribution. For each bootstrap “sample” $s
$ we will obtain a different estimate of the parameter of interest, $\hat \theta_s$. After several resamples, we get a distribution of $\hat \theta$.
How does bootstrap work?
In bootstrap sampling, we draw samples with replacement. In each bootstrap some original data points appear more than once while others do not appear at all. For each sample we obtain a estimate of our parameter of interest. We then estimate the variance of our parameter using the empirical variance calculated using the sample estimates.
What is the bootstraped variance formula?
See notion.
What do we need for bootstrap to work?
What we sample from needs to be iid!