Research Skills Part 3 Flashcards
Important note on correlation
Zero correlation means that there is no linear relation between x and y. But it does not imply independence!!!
Correlation is not causation!!!
Name 3 structures of correlation
- x causes y
- y causes x
- x causes y and y causes x > self-reinforcement
In a univariate regression…
Correlation determines sign of regression coefficient, and CORR^2 = R^2
What is RSS?
Residual Sum of Squares = sum of all the residuals squared
Give the formula for the beta coefficient
= cov(x,y) / var(x)
= (SD(y) / SD(x)) * CORR(x,y)
What is TSS?
Total Sum of Squares = sum (y – y-bar)^2
What is ESS
Explained Sum of Squares = sum (y-hat – y-bar)^2
Give the formula for R2
TSS = ESS + RSS
R2 = 1 –RSS/TSS = ESS/TSS
What are the drawbacks of R2?
- It depends on how dep var is defined (changes versus levels, wages versus log wages, etc.). It is only comparable if the dep var is the same.
- It always increases if you add more vars, even if they’re useless > compute Adj-R2
Note on (adj) R2
(adj) R2 is useful for comparing the relative performance of 2 models with same dep var. However, it is not useful for evaluating absolute performance.
Name 3 factors reducing the accuracy of OLS estimate
- Large error variance (s^2) > large influence of other variables are not in the model > OMITTED VARIABLE BIAS!!!!!
- Small number of observations
- Little spread in indep var > without variation in x one cannot explain variation in y, but too much variation is also bad
What is the F-test?
The F-test of overall significance indicates whether your linear regression model provides a better fit to the data than a model that contains no independent variables.
-Multiple regression: p-value of F-test equals p-value of null hypothesis that all coefficients are jointly equal to zero
When is the Omitted Variable Bias more severe and pose a solution for this problem?
Problem more severe when the x variable in regression has high correlation with omitted variable z
Solution: multivariate regression
What are the assumptions of the linear regression model?
- residuals have a mean of 0 and are independent
- residuals have a constant variance = homoskedasticity
- residuals are uncorrelated = no autocorrelation
- there’s no exact linear relation between the independent variables
.
Under these assumptions, the OLS estimators (betas) are BLUE = best linear unbiased estimator for the true beta.
Only then are the routinely computed S.E.s and t-stats correct.
. - residuals follow a normal distribution
.
If the errors are correlated with any of the independent vars, OLS is biased and inconsistent > wrong coefficient estimates!!
How do you test for non-linearity and how do you fix non-linearity issues?
Test: Ramsey’s RESET test to examine linearity of regression
Solution: use data transformation > take logs or add a squared term
Heteroskedasticity causes / consequences / testing / solutions
Causes:
- changing variance over time (time series)
- changing variance across firms (cross sectional)
Consequences:
- usual S.E. and t-stats not valid
- BUT, no impact on coefficients!!
Testing:
- visual testing
- statistical tests
Solutions:
- use corrected S.E.s
- use log transform or scale variables by size
Autocorrelation causes / consequences / testing / solutions
Causes:
- seasonality effects
- lead/lag effects > over/underreaction to news
- model misspecification
Consequences:
- usual S.E.s and t-stats not valid
- positive autocorr: S.E. understated and t-stats too big
- negative autocorr: S.E. overstated and t-stats too small
Testing:
- visual testing
- statistical tests
Solutions:
- add lagged dep/indep vars
- include dummy variable
- use corrected S.E.s
Multicollinearity causes / consequences / checking / solutions
Causes:
- 2 or more indep vars are highly correlated
Consequences:
- low t-stats and high S.E.s for individual coefficients
- weird signs or magnitudes of coefficient estimates
Checking:
- compute CORR matrix
- compute Variance Inflation Factor. VIF > 10 is a problem
Solutions:
- drop one variable > can lead to omitted variable bias
- collect more data to increase accuracy
Non-normality causes / consequences / testing / solutions
Causes:
- extreme observations
- bounded dep var
- binary dep var
- discrete dep var
Consequences:
- large sample > no problem
- small sample > inference about coefficients wrong and t-stats invalid
- BUT, no impact on coefficients!!
Testing:
- JB statistic to test for normal distribution
Solutions:
- winsorize / truncate
- use log transformation
- use other regression model > tobit, probit/logit
If the error term in a linear regression model is not normally distributed…
A. … the OLS estimator is biased
B. … routinely calculated S.E.s are incorrect
C. … we need to rely on asymptotic theory to perform valid tests
D. … we need to take the log of the dependent variable
C
In a linear regression model, if the slope coefficient of X has a t-stat of 3.0…
A. we accept the hypothesis that X has an impact
B. we accept that X is significant
C. we reject the null hypothesis that X is insignificant
D. we reject the null hypothesis that X has no impact
D
What do endogeneity and simultaneity mean?
Endogeneity broadly refers to situations in which an explanatory variable is correlated with the error term.
Simultaneity is another common cause of endogeneity. Simultaneity arises when one or more of the predictors (e.g., treatment variable) is determined by the response variable (Y). In simple terms, X causes Y and Y causes X.
Which problem does make the OLS estimator biased?
A. simultaneity between x and y
B. heteroskedasticity
C. a small sample
D. all of these
A
Which statement(s) is/are correct?
A. R2 is the most important statistic of a regression
B. R2 tells us how well the model fits the data
C. a larger R2 is always better
D. if R2=0, we have a useless model
B & D