Lecture 20 Flashcards
Internal vs external validity
Internal - does the design identify a causal effect?
External - can we generalise this effect to other settings?
Internal validity
Is the causal effect you’re estimating actually valid for this sample, threats:
1. OVB
2. Incorrect model functional form
3. Measurement error
4. Sample selection issues
5. Simultaneity
Each of these violate exogeneity, so OLS will be biased
What Is OVB ?
You get when you leave out a variable which is:
1. A determinant of the outcome variable y
2. Correlated with the regressor of interest x, exogeneity fails
How do controls help OVB?
If you include controls wi that soak up the influence of the OV, then the key condition becomes:
- E[ui|xi,wi] = E[ui|wi]
- so you’re good as long as the OV is conditionally uncorrelated
OVB in different models
OLS - classical OVB setup, fails if OV affects y and is correlated with x
IV - OV must not be correlated with instrument or instrument is invalid
Panel Data - use fixed effects to control for unobserved heterogeneity, but if OV varies within the fixed effects unit, still get OVB
Sign of the OVB
- effect of OV on y
- correlation between OV and x
Bias = (effect of OV on y) x (correlation with x)
Solutions to OVB
- Include the OV if observable
- Use controls if they proxy well for the OV
- Use Panel Data with fixed effect to eliminate unobserved time-invariant factors
- Use IVs if the OV can’t be measured, but you still have a valid instrument
- Run an experiment - randomisation breaks correlation between regressors and OVs
Functional form misspecification
When regression model doesn’t correctly capture the relationship between variables
- coefficients will be biased
Solutions to functional form misspecification
- Use appropriate nonlinear transformations
- Let the data guide you
- Use shrinkage methods like ridge/LASSO
- Model binary/ censored data correctly, so logit/ probit vs Tobit
- Accept some uncertainty
Errors-in-variables bias
- measurement error in regressors can lead to bias
Causes attenuation bias, estimated slope is biased towards zero
- classical measurement error which causes bias
Best-guess measurement error
A special case of measurement error where bias does not occur
- imagine a person doesn’t remember their income xi, but they make a best guess based on their education wi
- xi^ = E[xi|wi], then under two key assumptions:
Cov(xi^, xi - xi^) = 0
Cov(xi^,ui) = 0
Leads to cov(xi^,ui^) = 0, so coefficient is unbiased
Solutions to measurement error
- Get better data
- Model the measurement error process, if you know something about the source of error
- Use IVs, good if instruments are valid
- Errors in y? If uncorrelated with x, they do not bias B1^, only increase variance
Sample selection bias
3 types of missing data:
- missing at random - no bias
- missing based on x - no bias in a linear model, variation in x is reduced, which increases SEs and may reduce external validity
- missing based on y or u, does cause bias - actual sample selection bias
Benign missing data
- Data missing at random, if you took a random sample of 100 students, but lost 20 randomly in the wind, equivalent to taking a random sample of 80 students, so no bias
- Data are missing based on a value of one of the x’s, suppose you are studying the effect of the student-teacher ratio on test scores, but restrict your attention to school districts with STR < 20, in a linear model, focusing on a subset does not cause bias, reduces the variation in x, etc
Data missing on y or u
When selection into the sample depends on past outcomes or unobserved factors, OLS estimates become biased
Truncation and incidental truncation
Truncation: occurs when data is only included if it meets a cutoff yi < ci, solve by writing down the likelihood conditional on selection and estimate via MLE
Incidental Truncation: only observe y for a non-random subset, solve by using Heckman 2 stage model
Solutions to sample selection bias
- Design better sampling
- Run randomised experiments to avoid selection altogether
- Model the selection process explicitly
Simultaneous causality bias
Usually assume x -> y, but sometimes y also causes x
- means xi becomes correlated with ui, so exogeneity assumption fails and B1^ is biased
Solutions to SCB
- Run a randomised controlled experiment
- Model both directions of causality
- Use IVs
A not on using IV regression
IV solves 3 major problems:
1. OVB
2. Measurement Error
3. Simultaneous Causality
BUT:
- adds its own challenges, so the instrument must be exogenous and relevant
- if these fail, still biased, can get weak instrument bias which can be worse than OLS
Threats to internal validity of experiments
- Failure to randomise
- Failure to follow treatment protocol
- Attrition, people drop out in a way which is related to their potential outcomes
- Experimental effects: experimenter bias (researchers treat groups differently), Hawthorne effects (subjects change behaviour just from being studied)
Threats to internal validity for quasi experiments
- Failure to randomise
- Failure to follow treatment protocol
- Attrition
- Experimental effects - N/A
- Instrument invalidity, quasi random instrument must be exogenous and relevant.
Two main dimensions of external validity
- Different populations
- Different settings
More contextual than internal validity, not about statistical assumptions, but rather about how similar other settings are.
Threats to external validity in experiments
- Non-representative sample, participants in your study might not reflect the broader population
- Non-representative treatment, way a treatment is implemented in an experiment may be too artificial or costly for real-world application
- General equilibrium effects, effect of a program can change when it’s scaled up
Quasi-experiments often improve external validity because they study real-world programs