QE 3/4 - regression Flashcards

1
Q

Causes of residuals (errors between observations and ability to predict):

A
  1. Measurement error

2. Specification error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Properties of the CEF residual

A
  1. E[e] = 0
  2. E[e*g(x)] = 0
  3. E[e given X] = 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Properties of the LRM residual

A
  1. E[u] = 0
  2. E[uX] = 0
  3. E[u given X] = ?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why might the LRM residual (u) not be mean-independent of X (unlike CEF residual)?

A
  1. LRM is a linear model, but CEF may be non-linear/wiggly

2. Therefore, expected value of error at different values of X not necessarily the same for LRM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the interpretation of the p-value?

A

Probability of getting the estimate from the data, given that the null hypothesis is true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does high R-squared tell you? What does it not tell you?

A

1a. R-squared close to 1 means the explanatory variables are good at fitting the data (Y)
1b. Provides estimate of strength of relationship between model and data

2a. Does not mean the model will be good at extrapolating out of sample
2b. Does not say anything about causality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Threats to internal validity

A
  1. Contamination – people in control group access treatment anyway
  2. Non-compliance – individuals offered treatment refuse to take it
  3. Hawthorne effect – participants alter behaviour due to participating in experiment/study
  4. Placebo effect – impacts final outcomes because of perceived changes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the stable unit treatment value assumption (SUTVA)?

A
  1. Experimental ideal works only if there are no interaction effects between subjects
  2. i.e. each’s outcome depends only on their own treatment, not those of others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the conditional independence assumption?

A

Treatment assignment is independent of potential outcomes, conditional on covariates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain how the conditional independence assumption plausibly allows identification of causal effects

A
  1. Run regressions that include causal variable of interest + co-variates
  2. Co-variates assumed to ‘control for’ non-random variation in treatment assignment
  3. Variation left over (Frish-Waugh-Lovell theorem) plausibly independent of potential outcomes
  4. If credible, treatment assignment conditionally independent of potential outcomes and can therefore measure causal effects
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain how Frisch-Waugh Lovell theorem works (verbally)

A
  1. Find independent variation in X, not explained by other regressors
  2. Find independent variation in Y, not explained by other regressors
  3. Find independent variation in Y (not explained by other regressors) that is explained by independent variation in X
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. Explain the least squares assumption that E[u given X] = 0
  2. What is this assumption equivalent to?
  3. What does this assumption imply?
A

1a. ‘Other factors’ within residual (u) not systematically related to X (i.e. given value of X, mean of distribution of u = 0)
1b. Sometimes these other factors within residual lead Y to be higher/lower than predicted, but on average 0
2. Equivalent to assuming that the population regression line = conditional mean of Y given X
3. Implies that X and u uncorrelated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. Why does higher variance in X lead to lower variance in the slope coefficients of regression model?
  2. Intuition?
A
  1. Greater variation in X means we can obtain more precise estimate of slope coefficient
  2. Intuition - if all data bunched around the mean, hard to draw linear line; easier if there’s more variation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is perfect multi-collinearity?

A

When 1 regressor = perfect linear combination of the other regressors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Mathematically, why does perfect multi-collinearity make it impossible to calculate the OLS estimator?

A

Division by 0 in OLS formulas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Intuition behind why perfect multi-collinearity is a problem

A
  1. Asks illogical question
  2. In multiple regression, coefficients = effect of change in that regressor, holding other regressors constant
  3. But if 1 regressor is a perfect linear combination of others, then asking the effect of change in that regressor, holding itself constant…
17
Q

Example of perfect multi-collinearity

A
  1. Fraction of English learners and % of English learners
  2. If one regressor is same for all observations, then it is a perfect linear combination of the intercept term (if there is an intercept)
18
Q
  1. How to avoid the dummy variable trap?

2. What is the interpretation of the binary variables included?

A
  1. Exclude one of the binary variables from the regression

2. Represent incremental effect, relative to base case of omitted category

19
Q

What is imperfect multi-collinearity?

A

Means 2 or more regressors are highly correlated (in sense that there is a linear combination of the regressors that is highly correlated with another regressor)

20
Q

What is the implication of imperfect multi-collinearity

A
  1. Coefficients on at least 1 regressor imprecisely estimated
  2. Difficult to estimate precisely one or more of the partial effects
21
Q

What is the F-statistic used for?

A

To test joint hypotheses about regression coefficients

22
Q

What question is addressed by the use of the F-statistic?

A
  1. Does relaxing q restrictions that constitute the null hypothesis improve fit of regression sufficiently that improvement unlikely to be result merely of random sampling variation (if H0 = true)?
23
Q

What should large F-statistic be associated with?

A

Significant increase in R-squared

24
Q

When is the null rejected in an F-test?

A

If SSR sufficiently small in unrestricted regression compared to restricted regression, then test rejects null hypothesis

25
Q

What are control variable used for?

A
  1. Control for causal effect of a variable (if any)
  2. Control for omitted factors that affect Y and are correlated with X
  3. Increase precision of estimates (if control variable not correlated with regressor of interest but correlated with outcome, then standard errors of estimators reduced)
26
Q

Steps to decide whether/not to include a variable in regression

A

(1) identify coefficient of interest
(2) assess a priori (before reviewing data) most important likely sources of omitted variable bias
(3) test whether additional ‘control’ variables (identified in step 2) statistically significant/if estimated coefficient of interest changes measurably when controls added (if so, keep them; if not, remove them)
(4) fully disclose all regressions to allow others to judge for themselves

27
Q

Problem with including a variable where it doesn’t belong (i.e. population regression coefficient = 0)?

A

Reduce precision of estimators of other regression coefficients

28
Q

Consequences of simultaneous causality?

A
  1. OLS estimator biased and inconsistent (because it picks up both effects)
  2. Correlation between regressor and error term
29
Q
  1. When does compositional bias arise?

2. Explain what it means using an example

A
  1. If add control variable that is an outcome of the regressor of interest, then get compositional bias

2a. Example - control for occupation when regressing earnings on schooling
2b. Effect of schooling on earnings, controlling for type of job you do, will seem small
2c. Problem is that schooling affects your occupation and influences earnings partly via your occupation
2d. Can’t assess impact of schooling on earnings by looking within occupations