exam 3 Flashcards
Quadratic population regression model
test score = p0+p1income+p2income^2+u
Note that we can test if the linear specification is true against the
alternative that the quadratic specification is true by testing
H0 : β2=0 vs β2 doesn’t equal 0
Relationship b/w Y&X is nonlinear
- Effect on Y of a change in X depends on X (marginal effect of X isn’t constant)
- linear regression is mis-specified: the functional form is wrong
- estimator of the effect on Y of X is bias
- the solution is to estimate a regression function that is nonlinear in X
Internal validity
the statistical inferences about casual effects are valid for the population being study
External validity
statistical inferences can be generalized from population + setting studied to other population + setting
setting= legal, policy, physical environment
Threats to external validity
Assessing threats to external validity requires detailed substantive knowledge and judgment on a case-by-case basis
How far can we generalize class size results from California?
– Differences in populations
*California in 2011?
* Massachusetts in 2011?
* Mexico in 2011?
– Differences in settings
* different legal requirements (e.g. special education)
* different treatment of bilingual education
– differences in teacher characteristics
Internal validity Threats
SOWES
Sample selection bias
Omitted variable bias
Wrong functional form
Error in variable buas
Simultaneous causality bias
All of these imply that E(ui|X1i,…,X ki) ≠ 0 (or that conditional mean independence fails)
meaning OLS is biased and inconsistent.
Omitted Variable Bias arises if
1. determinant of Y
2. correlated with at least one included regressor
A control variable W correlated with, and
controls for, an omitted causal factor in the regression of Y
on X, but which itself does not necessarily have a causal effect
- If the multiple regression includes control variables,
- there are omitted factors that are not
adequately controlled for
-whether the error term is
correlated with the variable of interest even after we have
included the control variables.
What are solutions to omitted variable bias?
- Include omitted causal variable as another regressor
- have data on one + controls and they’re adequate, then include control variables
- use panel data, each entity is observed more than once
- if omitted variable can’t be measured, use instrumental variable regression
- replace dependent variable correlated w/ error with other that’s not correlated with error - run randomized controlled experiment
if X is randomly assigned, then X necessarily will be distributed independently of u; thus E(u|X = x) = 0.
Wrong functional form
- if functional form is incorrect
ex: an interaction term is incorrectly omitted;
then inferences on causal effects will be biased.
Solution
1. Continuous dependent variable: use the “appropriate”
nonlinear specifications in X (logarithms, interactions,
etc.)
2. Discrete (example: binary) dependent variable: need an
extension of multiple regression methods (“probit” or
“logit” analysis for binary dependent variables).
Errors in variable bias
So far we have assumed that X is measured without
error.
In reality, economic data often have measurement
error
Lessons in classical measurement error
- The amount of bias in beta hat depends on the nature of the measurement error
- If there is pure noise added to Xi, then beta hat is biased towards 0
- The potential importance of measurement error bias depends
on how the data are collected.
– administrative data (e.g. # teachers in a school) are often quite accurate.
– Survey data on sensitive questions (how much do you earn?)
often have considerable measurement error
Solutions to errors in variable bias
- Obtain better data
- Develop a specific model of measurement error process
- instrumental variables regression
Missing data + sample selection bias
- Data are missing at random.
- Data are missing based on the value of one or more X’s
- Data are missing based in part on the value of Y or u
Cases 1 and 2 don’t introduce bias: the SE are larger than they would be if the data weren’t missing but is ˆβ
unbiased.
Case 3 introduces “sample selection” bias.
Case 1: data are missing at random
Suppose you took a simple random sample of 100 workers, dog ate 20 of the response sheets (selected
at random) before you could enter them into the computer
- This is equivalent to your having taken
a simple random sample of 80 workers , so your dog didn’t introduce any bias
Case 2 Data are missing based on a value of one of the X’s
restrict your analysis to the subset of school districts with STR < 20.
By only considering districts with small class sizes you won’t be able to say anything about districts with large class sizes, but focusing on just the small-class districts doesn’t
introduce bias.
This is equivalent to having missing data,
where the data are missing if STR > 20. More generally, if data are missing based only on values of X’s, the fact that
data are missing doesn’t bias the OLS estimator.