QE 3/4 - regression Flashcards
Causes of residuals (errors between observations and ability to predict):
- Measurement error
2. Specification error
Properties of the CEF residual
- E[e] = 0
- E[e*g(x)] = 0
- E[e given X] = 0
Properties of the LRM residual
- E[u] = 0
- E[uX] = 0
- E[u given X] = ?
Why might the LRM residual (u) not be mean-independent of X (unlike CEF residual)?
- LRM is a linear model, but CEF may be non-linear/wiggly
2. Therefore, expected value of error at different values of X not necessarily the same for LRM
What is the interpretation of the p-value?
Probability of getting the estimate from the data, given that the null hypothesis is true
What does high R-squared tell you? What does it not tell you?
1a. R-squared close to 1 means the explanatory variables are good at fitting the data (Y)
1b. Provides estimate of strength of relationship between model and data
2a. Does not mean the model will be good at extrapolating out of sample
2b. Does not say anything about causality
Threats to internal validity
- Contamination – people in control group access treatment anyway
- Non-compliance – individuals offered treatment refuse to take it
- Hawthorne effect – participants alter behaviour due to participating in experiment/study
- Placebo effect – impacts final outcomes because of perceived changes
What is the stable unit treatment value assumption (SUTVA)?
- Experimental ideal works only if there are no interaction effects between subjects
- i.e. each’s outcome depends only on their own treatment, not those of others
What is the conditional independence assumption?
Treatment assignment is independent of potential outcomes, conditional on covariates
Explain how the conditional independence assumption plausibly allows identification of causal effects
- Run regressions that include causal variable of interest + co-variates
- Co-variates assumed to ‘control for’ non-random variation in treatment assignment
- Variation left over (Frish-Waugh-Lovell theorem) plausibly independent of potential outcomes
- If credible, treatment assignment conditionally independent of potential outcomes and can therefore measure causal effects
Explain how Frisch-Waugh Lovell theorem works (verbally)
- Find independent variation in X, not explained by other regressors
- Find independent variation in Y, not explained by other regressors
- Find independent variation in Y (not explained by other regressors) that is explained by independent variation in X
- Explain the least squares assumption that E[u given X] = 0
- What is this assumption equivalent to?
- What does this assumption imply?
1a. ‘Other factors’ within residual (u) not systematically related to X (i.e. given value of X, mean of distribution of u = 0)
1b. Sometimes these other factors within residual lead Y to be higher/lower than predicted, but on average 0
2. Equivalent to assuming that the population regression line = conditional mean of Y given X
3. Implies that X and u uncorrelated
- Why does higher variance in X lead to lower variance in the slope coefficients of regression model?
- Intuition?
- Greater variation in X means we can obtain more precise estimate of slope coefficient
- Intuition - if all data bunched around the mean, hard to draw linear line; easier if there’s more variation
What is perfect multi-collinearity?
When 1 regressor = perfect linear combination of the other regressors
Mathematically, why does perfect multi-collinearity make it impossible to calculate the OLS estimator?
Division by 0 in OLS formulas
Intuition behind why perfect multi-collinearity is a problem
- Asks illogical question
- In multiple regression, coefficients = effect of change in that regressor, holding other regressors constant
- But if 1 regressor is a perfect linear combination of others, then asking the effect of change in that regressor, holding itself constant…
Example of perfect multi-collinearity
- Fraction of English learners and % of English learners
- If one regressor is same for all observations, then it is a perfect linear combination of the intercept term (if there is an intercept)
- How to avoid the dummy variable trap?
2. What is the interpretation of the binary variables included?
- Exclude one of the binary variables from the regression
2. Represent incremental effect, relative to base case of omitted category
What is imperfect multi-collinearity?
Means 2 or more regressors are highly correlated (in sense that there is a linear combination of the regressors that is highly correlated with another regressor)
What is the implication of imperfect multi-collinearity
- Coefficients on at least 1 regressor imprecisely estimated
- Difficult to estimate precisely one or more of the partial effects
What is the F-statistic used for?
To test joint hypotheses about regression coefficients
What question is addressed by the use of the F-statistic?
- Does relaxing q restrictions that constitute the null hypothesis improve fit of regression sufficiently that improvement unlikely to be result merely of random sampling variation (if H0 = true)?
What should large F-statistic be associated with?
Significant increase in R-squared
When is the null rejected in an F-test?
If SSR sufficiently small in unrestricted regression compared to restricted regression, then test rejects null hypothesis
What are control variable used for?
- Control for causal effect of a variable (if any)
- Control for omitted factors that affect Y and are correlated with X
- Increase precision of estimates (if control variable not correlated with regressor of interest but correlated with outcome, then standard errors of estimators reduced)
Steps to decide whether/not to include a variable in regression
(1) identify coefficient of interest
(2) assess a priori (before reviewing data) most important likely sources of omitted variable bias
(3) test whether additional ‘control’ variables (identified in step 2) statistically significant/if estimated coefficient of interest changes measurably when controls added (if so, keep them; if not, remove them)
(4) fully disclose all regressions to allow others to judge for themselves
Problem with including a variable where it doesn’t belong (i.e. population regression coefficient = 0)?
Reduce precision of estimators of other regression coefficients
Consequences of simultaneous causality?
- OLS estimator biased and inconsistent (because it picks up both effects)
- Correlation between regressor and error term
- When does compositional bias arise?
2. Explain what it means using an example
- If add control variable that is an outcome of the regressor of interest, then get compositional bias
2a. Example - control for occupation when regressing earnings on schooling
2b. Effect of schooling on earnings, controlling for type of job you do, will seem small
2c. Problem is that schooling affects your occupation and influences earnings partly via your occupation
2d. Can’t assess impact of schooling on earnings by looking within occupations