5. Multiple Regression Flashcards
Heteroskedacity
The property of having a nonconstant variance; refers to an error term with the property that its variance differs across observations.
3 Violations of Regression Assumptions
Heteroskedasticity
Serial Correlation
Multicollinearity
6 Assumptions of Classical Normal Multiple Linear Regression
- The relationship between the dependent variable, Y, and the independent variables, X1, X2, …, Xk, is linear.
- The independent variables (X1, X2, …, Xk) are not random. Also, no exact linear relation exists between two or more of the independent variables.
- The expected value of the error term, conditioned on the independent variables, is 0: E( ε | X1, X2, …, Xk) = 0.
- The variance of the error term is the same for all observations.
- The error term is uncorrelated across observations: E( εiεj) = 0, j ≠ i.
- The error term is normally distributed.
F-statistic
To test the null hypothesis that all of the slope coefficients in the multiple regression model are jointly equal to 0 (H0: b1 = b2 = … = bk = 0) against the alternative hypothesis that at least one slope coefficient is not equal to 0 we must use an F-test. The F-test is viewed as a test of the regression’s overall significance. The F-test for determining whether the slope coefficients equal 0 is based on an F-statistic calculated using the four values listed above. The F-statistic measures how well the regression equation explains the variation in the dependent variable; it is the ratio of the mean regression sum of squares to the mean squared error.
Adjusted R^2
A measure of goodness-of-fit of a regression that is adjusted for degrees of freedom and hence does not automatically increase when another independent variable is added to a regression.
Dummy Variables
A type of qualitative variable that takes on a value of 1 if a particular condition is true and 0 if that condition is false.
Serial Correlation
With reference to regression errors, errors that are correlated across observations.
Multicollinearity
A regression assumption violation that occurs when two or more independent variables (or combinations of independent variables) are highly but not perfectly correlated with each other.
Serial Correlation Remedies
We have two alternative remedial steps when a regression has significant serial correlation. First, we can adjust the coefficient standard errors for the linear regression parameter estimates to account for the serial correlation. Second, we can modify the regression equation itself to eliminate the serial correlation. We recommend using the first method for dealing with serial correlation; the second method may result in inconsistent parameter estimates unless implemented with extreme care.
Multicollinearity Effects
OLS estimates of the regression coefficients can be consistent, but the estimates become extremely imprecise and unreliable.
It becomes practically impossible to distinguish the individual impacts of the independent variables on the dependent variable.
Inflated OLS standard errors for the regression coefficients. With inflated standard errors, t-tests on the coefficients have little power (ability to reject the null hypothesis).
The analyst should be aware that using the magnitude of pairwise correlations among the independent variables to assess multicollinearity, as has occasionally been suggested, is generally not adequate. Although very high pairwise correlations among independent variables can indicate multicollinearity, it is not necessary for such pairwise correlations to be high for there to be a problem of multicollinearity. High pairwise correlations among the independent variables are not a necessary condition for multicollinearity, and low pairwise correlations do not mean that multicollinearity is not a problem.
The only case in which correlation between independent variables may be a reasonable indicator of multicollinearity occurs in a regression with exactly two independent variables.
The classic symptom of multicollinearity is a high R2 (and significant F-statistic) even though the t-statistics on the estimated slope coefficients are not significant. The insignificant t-statistics reflect inflated standard errors. Although the coefficients might be estimated with great imprecision, as reflected in low t-statistics, the independent variables as a group may do a good job of explaining the dependent variable, and a high R2 would reflect this effectiveness.
Serial Correlation Effects
an incorrect estimate of the regression coefficient standard errors computed by statistical software packages. As long as none of the independent variables is a lagged value of the dependent variable (a value of the dependent variable from a previous period), then the estimated parameters themselves will be consistent and need not be adjusted for the effects of serial correlation. If, however, one of the independent variables is a lagged value of the dependent variable— for example, if the T-bill return from the previous month was an independent variable in the Fisher effect regression— then serial correlation in the error term will cause all the parameter estimates from linear regression to be inconsistent and they will not be valid estimates of the true parameters. Although positive serial correlation does not affect the consistency of the estimated regression coefficients, it does affect our ability to conduct valid statistical tests. First, the F-statistic to test for overall significance of the regression may be inflated because the mean squared error (MSE) will tend to underestimate the population error variance. Second, positive serial correlation typically causes the ordinary least squares (OLS) standard errors for the regression coefficients to underestimate the true standard errors. As a consequence, if positive serial correlation is present in the regression, standard linear regression analysis will typically lead us to compute artificially small standard errors for the regression coefficient. These small standard errors will cause the estimated t-statistics to be inflated, suggesting significance where perhaps there is none. The inflated t-statistics may, in turn, lead us to incorrectly reject null hypotheses about population values of the parameters of the regression model more often than we would if the standard errors were correctly estimated. This Type I error could lead to improper investment recommendations.
Heteroskedasticity Consequences
Although heteroskedasticity does not affect the consistency31 of the regression parameter estimators, it can lead to mistakes in inference. When errors are heteroskedastic, the F-test for the overall significance of the regression is unreliable. 32 Furthermore, t-tests for the significance of individual regression coefficients are unreliable because heteroskedasticity introduces bias into estimators of the standard error of regression coefficients. If a regression shows significant heteroskedasticity, the standard errors and test statistics computed by regression programs will be incorrect unless they are adjusted for heteroskedasticity.
Heteroskedasticity Remedies
We can use two different methods to correct the effects of conditional heteroskedasticity in linear regression models. The first method, computing robust standard errors, corrects the standard errors of the linear regression model’s estimated coefficients to account for the conditional heteroskedasticity. The second method, generalized least squares, modifies the original equation in an attempt to eliminate the heteroskedasticity. The new, modified regression equation is then estimated under the assumption that heteroskedasticity is no longer a problem.
Unconditional Heteroskedasticity
Heteroskedasticity of the error term that is not correlated with the values of the independent variable( s) in the regression.
Conditional Heteroskedasticity
Heteroskedasticity in the error variance that is correlated with the values of the independent variable( s) in the regression. Most problematic.