07 Multiple Regression Model Flashcards
Omitted variable bias
The bias in the OLS estimator that occurs as a result of an omitted factor, or variable, is called omitted variable bias. For omitted variable bias to occur, the omitted variable ”Z” must satisfy two conditions:
• The omitted variable is correlated with the included regressor (i.e. corr(Z,X) ̸= 0)
• The omitted variable is a determinant of the dependent variable (i.e. Z is part of u)
E(B1) -[p]-> B1 + corr(Xi, ui) * std(u)/std(X)
The formula indicates that:
• Omitted variable bias exist even when n is large.
• The larger the correlation between X and the error term the larger the bias.
• The direction of the bias depends on whether X and u are negatively or positively correlated.
How to overcome omitted variable bias
- Ideal controlled experiment
- Include the variable in the regression
- Do cross tabulation
Advantages of the MLRM over the SLRM:
Advantages of the MLRM over the SLRM:
• By adding more independent variables (control variables) we can explicitly control for other factors affecting y.
• More likely that the zero conditional mean assumption holds and thus more likely that we are able to infer causality.
• By controlling for more factors, we can explain more of the variation in y, thus better predictions.
• Can incorporate more general functional forms.
Assumptions of the MLRM
Assumptions of the MLRM
- Random sampling
- Large outliers are unlikely
- Zero conditional mean
- (There is sampling variation in X) and there are no exact linear relationships among the independent variables (No perfect collinearity).
- (The model is linear in parameters)
Under these assumptions the OLS estimators are unbiased estimators of the population parameters. In addition there is the homoskedasticity assumption which is necessary for OLS to be BLUE.
Important properties of the OLS fitted values and residuals
The OLS fitted values and residuals have the same important properties as in the simple linear regression:
• The sample average of the residuals is zero and so avg(Y) = avg(E(Y))
• The sample covariance between each independent variable and the OLS residuals is zero. Consequently, the sample covariance between the OLS fitted values and the OLS residuals is zero.
• The point (X ̄ , X ̄ , …, X ̄ , Y ̄ ) is always on the OLS regression line.
Consequences of heteroskedaticity on the quality of the OLS estimators
- Under the OLS assumptions, including homoskedasticity, the OLS estimators E(βj) are the best linear unbiased estimators of the population parameter βj .
- Under heteroskedasticity the OLS estimators are not necessarily the one with the smallest variance.
When two or more of the regressors are highly correlated (but not perfectly correlated), ___
When two or more of the regressors are highly correlated (but not perfectly correlated), it is hard to estimate the effect of the one variable holding the other constant.
The higher the correlation between X and X the higher the variance of E(β1). Thus, when multiple regressors are imperfectly collinear, the coefficients on one or more of these regressors will be imprecisely estimated.
overspecification
A model that includes irrelevant variables is called an overspecified
model.
The OLS estimators are inconsistent if ___
The OLS estimators are inconsistent if the error is correlated with any of the independent variables.
Why adjusted R-squared?
The adjusted R-squared is introduced in MLRM to compensate for the increasing R-squared.
Adjusted R-squared =
Adjusted R-squared =
1 - (n - 1) / (n - k − 1) * SSR / TSS
SER in MLRM
SER = SSR / (n - k - 1)
Pure vs impure heteroskedasticity
- Pure heteroskedasticity is caused by the error term of a correctly specified equation.
- Heteroskedasticity is likely to occur in data sets in which there is a wide disparity between the largest and smallest observed values.
- Impure heteroskedasticity is heteroskedasticity caused by an error in specification, such as an omitted variable.
Dummy variable
Boolean indicator