Cross-Section Data Concepts Flashcards
Ordinary Least Squares (OLS)
Regression that is used to analyze how variables are related to each other. OLS draws the best fitted line to the observations.
Control variables and interaction terms can be added to improve performance of OLS model.
Logarithm can be used for dependent variable to better fit the data. Introducing logarithm makes the model non-linear.
1. Log-Level: A 1 unit change in x is related to a 100*ß percentage point change in y.
2. Log-Log: A 1% change in x is related to a ß % change in Y.
3. Level-Log: A 1% change in x is related to a ß/100 change in y.
Common Problems with OLS:
1. Omitted Variables
2. Reverse Causality
3. Measurement Error
Identification Problem (OLS)
Correlation does not imply causation. Causation can be inferred from:
- Dif-in-Dif
- IV
- RD
- Structural VAR
BLUE
Best Linear Unbiased Estimate
Key identifying assumptions for OLS
Endogeneity
If the explanatory variable is correlated with the error term, the variable is endogenous. It could be due to omitted variables but not only.
Linear Probability Model
A linear probability model uses a dummy variable as the dependent variable.
Multiple Regressions
One way to solve the omitted variables problem of OLS, by including more factors in the regression.
Increases the efficiency of the model.
Does not necessarily solve the omitted variable bias.
Panel Data
Using panel data allows to control for all variables that do not change over time.
Disadvantage:
- One has to be sure that the variables indeed do not change over time.
- Variables that change over time can only be included via dummies. But with to many dummies, the model might become overspecified.
- Panel data can lack variation in the data.
Difference-in Difference (DiD)
Using DiD one divides the sample into two groups - the control group and the treatment group. These two groups are then compared before and after a treatment occurred.
Key assumptions:
- Parralel trend assumption -> if the treatment had not happened, trends would have stayed the same (no statistical test to see whether assumption holds)
- Random assignment to groups is not an assumption but rather a necessary condition to draw conclusions about causality
Advantage:
- solves causality problem of OLS
Disadvantage:
- Randomness is a pre-condition.
- Problematic if there is a common trend before the treatment.
Instrumental Variable (IV)
Two conditions for an instrument to be valid:
- Instrument z should be strongly correlated with independent variable x
- Instrument is valid when: F-Statistic > 10 or t-statistic > 3.33
- F-statistic has to be used when there is more than one instrument - Instrument z should not be correlated with the error u
- cannot be tested as u is unobservable
- SE of instrument can give indication on validity
Advantages:
- IV solves the three problems of OLS
- possible to include more than one instrument (compared to DiD)
- possible to make causal interferences
Disadvantage:
- SEs are relatively large -> loss of efficiency
- possible overidentification
Test to choose between IV and OLS:
Hausmann test
H0 = no endogeneity
HA = endogeneity
Overidentification
When adding more than one instrument to the model, the model can become overidentified. The general idea is that more instruments are included than needed to estimate the parameters consistently. This worsens the performance of the model. This is specifically a problem when you include more instruments than endogenous variables.
Tests for overidentification:
1. Sargan test
- assumes that at least one instrument is valid
- an instrument that is invalid is correlated with the residual
H0 = No overidentifaction/valid instruments
HA = at least one instrument is invalid
2. Hansen’s J-statistic
- used when there is heteroskedasticity
- interpretation same as for Sargan test
LATE Theorem
The Local Average Treatment Effect (LATE) refers to the limitation that IV estimates are based on the behavior of those captured by the instrument and therefore no conclusions can be drawn on the behaviour of others. Depending on the nature of the instrument, it may be impossible to identify any meaningful subpopulation whose behaviour is being measured.
Regression Discontinuity
- Similar to IV and can also be used to establish causal relation.
- In RD you take a subsample, which consists of observations that are close around the instrument (before and after).
- The further you move away from the threshold (the larger the bandwidth gets), the more dissimilar control and treatment group become (SE increases).
- For small bandwidth conclusions about causality can be drawn.
- Disadavantage of small bandwidth: small sample size
The donut method:
- fact that observations close to threshold are removed as they might be biased.
- Reduces the bias from manipulation
- If manipulation impossible, no need for donut method.
Natural Experiments
Conducted by selecting random sample and dividing into treatment and control group. Treatment group is offered treatment, control group not.
Key assumptions:
- characteristics of groups similar
- ensured by random selection if sample is large enough
Advantage:
- no reverse causality due to random selection
- possibility to establish causal relationship
Disadvantage:
- not always feasable
- some characteristics cannot be changed or controlled
- possibility of treatment dilution
Misspecification
Specification of the model:
- normality
- homoskedasticity
- -> Variance of error needs to be constant
- functional form
- -> a multiple regression suffers from a functional form misspecification when it does not properly account for the relationship between dependent and observed explanatory variables
Test for misspecification:
Ramsey RESET test
H0 = regression is well specified
HA = there are omitted variables
Heteroskedasticity
The variance of the error term is not constant across observations.
Test for heteroskedasticity:
1. Berausch-Pagan test
H0 = homoskedasticity (no heteroskedasticity)
HA = heteroskedasticity
2. White Test
H0 = homoskedasticity (no heteroskedasticity)
HA = heteroskedasticity
If there is heteroskedasticity in the data it is necessary to use robust standard errors