Midterm Flashcards
Causal relationship
a change in one variable (action) CAUSES change in another variable (result)
Correlation
the change between X and Y can partially be explained by other factors
Error Term
- Deviation of the observed Y from the true line
- Represented by εi in structural equation
- A theoretical representation of unobserved variables that explains for the remaining change not interpreted by the model (omitted variable absorbed by error term)
Residual
- Deviation of the observed Y from the estimated line
- Calculated by e_i = Y_i – (Y_i )̂
- *Oberserved - Estimated**
R^2
Goodness of Fit
- Ranges from 0 – 1
- Closer to 1 = better fit
- Adjusted R2: Includes “penalty” for adding additional regressors
null hypothesis
The null hypothesis states “no difference” or “no effect”
alternative hypothesis
The alternative hypothesis states there is a difference/effect
T-test
- If the absolute value of the t-stat is bigger than the critical value (e.g. 1.96) it means we can reject the null hypothesis and accept the alternative that the true coefficient is not zero
- *our variable is statistically significant at the 5% level of significance
- This also means the p-value is smaller than 5% (0.05).
T-test formula
Divide the coefficient by the standard error to get the t-value
F-test
Test a set of regression coefficients for joint significance
- H0: β1 = β2 = β3= 0 (ALL coefficients = 0)
- HA: β1 ≠ 0 OR β2 ≠ 0 OR β3 ≠ 0 (at least 1 coefficient NOT equal to 0)
F-stat > Critical Value = Reject the Null
(p-value of F lower than the level of significance)
F-test formula
You want the F-stat high & probability low
Interpreting Coefficients:
Level-Level
Y = β1 X1
on average a one-unit increase in X is associated with a β1-unit increase in Y, holding all else constant
Interpreting Coefficients:
Log-Level
lnY= β1 X1
on average a one-unit increase in X is associated with a β1% increase in Y, holding all else constant
Interpreting Coefficients:
Level-Log
Y= β1 lnX1
on average a 1% increase in X is associated with a β1-unit increase in Y, holding all else constant
Interpreting Coefficients:
Log-Log
lnY= β1 lnX1
on average, a 1% increase in X is associated with a β1% increase in Y, holding all else constant
Dummy/binary variable
Only has two possible values – e.g. X = 1 if female; X= 0 is male
Y = B0 + B1female
Ex: On average, being female is associated with a B1 difference in Y compared to male, holding all else constant
Categorical Variable
A variable like “region” has multiple values (south, west, northeast, midwest) that should be transformed into individual dummy (0 or 1) variables
Y = B0 + B1south + B2west + B3 northeast
Ex: On average, living in the South is associated with a B1 change in Y compared to the Midwest, holding all else constant.
Interaction term
An independent variable in a regression equation that is the multiple of two or more other independent variables. Each interaction term has its own regression coefficient
Does the effect of work experience on salary differ between males and females?
Y = B0 +B1Experience + B2Female + B3(Experience*Female) + e
Ex: On average, a one-unit increase in experience has a B3 difference in Y for females compared to males, holding all else constant
This allows the effect of experience on income to vary by gender
B3 now measures the effect of an additional year of experience for females relative to males
7 Classical Assumptions
- Regression model is linear (in B’s), correctly specified, and has an additive error term
- The error term has a population mean of zero
- The explanatory variables are not correlated with the error term
- Observations of the error term are not correlated
- The error term has a constant variance
- The regressors are uncorrelated with each other
- Error term is normally distributed
Omitted Variable Bias
Y = β0 + β1X1 +e
where error term absorbs an omitted variable X2
Variable Inclusion Criteria
Theory: is there sound justification for including the variable?
Bias: do the coefficients for other variables change noticeably when the variable is included?
T-Test: is the variable’s estimated coefficient statistically significant?
R-square: has the R-square (adjusted R-square) improved?
First-order serial correlation
occurs when the value of the error term in one period is a function of its value in the previous period; the current error term is correlated with the previous error term.
DW Test
compare DW(d) to the critical values (𝐝_𝐋, 𝐝_𝐔)
Newey-West Standard Errors
-Designed to correct for the consequences of first-order serial correlation; they are technically still biased, but are more accurate than OLS standard errors so they can be used for t-tests and other hypothesis tests
Newey-West SE > OLS SE
-Larger standard errors produce lower t-scores, so coefficients won’t be as statistically significant
Heteroskedasticity
happens when the standard errors of a variable, monitored over a specific amount of time, are non-constant. With heteroskedasticity, the tell-tale sign upon visual inspection of the residual errors is that they will tend to fan out over time, as depicted in the image below.
Pure Heteroskedasticity
occurs in correctly specified equations
Impure Heteroskedasticity
arises due to model misspecification
Multicollinearity
state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable.
Perfect Multicollinearity
virtually always the result of a definitional relationship between the independent variables, and is solved by dropping variables from the regression.
Imperfect Multicollinearity
describes the existence of a strong (but not exact) linear relationship between two or more independent variables that can significantly affect the estimates of coefficients.
Multicollinearity
Multicollinearity exists in every equation & the severity can change from sample to sample.
There are no generally accepted true statistical tests for multicollinearity.
VIF > 5 as a rule of thumb
Outliers
A distinctly unusual observation or extreme value
Unbiased
Parameter estimates are, on average, equal to the parameter’s true value in the population model
Unbiased Equation
E(Bhat)=B
Distrobution of Bhat is centered around B
Efficient
Has the lowest variance among unbiased estimators
Multicollinearity
strong (but not exact) linear relationship between two or more regressors
Best Linear Unbiased Estimator
If first 6 classical assumptions are met
OLS stands for
Ordinary least squares
Most Common remedies for multicollinearity
- Do Nothing
- Drop a redundant variable
- Increase the sample size
Impure Serial Correlation
Serial correlation that is caused by a specification error such as an omitted variable or an incorrect functional form
Pure Serial Correlation
This type of serial correlation occurs when the error in one period is correlated with the errors in other periods. The model is assumed to be correctly specified.
Best remedy for impure serial correlation
attempt to find the ommitted variable or the correct functional form for the equation
Stochastic Error Term
term that is added to aregression equation to introduce all the variation in the dependent variable that cannot be explained by the independent variables that have been included
Equation: Y = B0+B1X+e
Residual Error Term
The difference between the estimated value of the dependent variable and the actual value of the dependent variable (observered-estimated)
Equation: ei=Yi-Y^i
Durbin-Watson d statistic test
Used to determine if there is first-order serial correlation in the error term of an equation by examining residuals. Includes dL(lower bound) and dU (upper bound).
Durbin-Watson assumptions
- regression model includes an intercept term
- serial correlation is first-order in nature
- regression model does not include a lagged dependent variable as an independent variable
What if Durbin-Watson d-statistic is outside of upper limit?
We do not reject the null hypothesis of no autocorrelation since there is no statistical evidence of first order positive serial correlation
White Test
Used to test for heteroskedasticity
t-test formula
coefficient divided by standard error
r2 formula
ess(model) divided by tss (total)
Sign of bias = sign of OVB x sign correlation
sign of bias
5% confidence
1.96
1% confidence
2.787
Omitted Variable (issue)
Bias in the coefficient estimates of the included X's #3 OLS classical assumption
Omitted Variable (correction)
Include the ommitted variable or a proxy #3 OLS classical assumption
Irrelevant variable
Inclusion of a variable that does not belong in the equation
Incorrect Functional Form (issue)
The function form is inappropriate #1 OLS classical assumption
Incorrect Functional Form (correction)
Transform the variable or the equation to a different functional form #1 OLS classical assumption
Multicollinearity (issue)
Some of the ind. variables are imperfectly correlated #6 OLS classical assumption
Multicollinearity (correction)
Drop the redudant variables but often doing nothing is best #6 OLS classical assumption
Serial Correlation (issue)
Observations of the error term are correlated #4 OLS classical assumption
Serial Correlation (correction)
if impure, fix the specification. consider Geralized Least Squares or Newey-West standard errors #4 OLS classical assumption
Heteroskedasticity (issue)
The variance of the error term is not constant for all observations #5 OLS classical assumption
Heteroskedasticity (correction)
if impure, fix the specification. Use the HC standard errors or reformulate the variables