Multiple Regression Flashcards
We can use multiple regression models to:
1 - Identify relationships between variables
2 - Forecast variables
3 - Test existing theories
The general multiple linear regression model is:
Yi = b0 + b1X1i + b2X2i + … + bkXki + εi
The residual, ε is
the difference between the observed value, Yi, and the predicted value from the regression, Y
The p-value is
the smallest level of significance for which the null hypothesis can be rejected
If the p-value is less than the significance level
the null hypothesis can be rejected
If the p-value is greater than the significance level
the null hypothesis cannot be rejected.
intercept term
is the value of the dependent variable when the independent variables are all equal to zero.
Assumptions underlying a multiple regression model include
- linear relationship exists between x + y.
- residuals are normally distributed.
- variance of the error terms is constant.
- residual are not correlated.
- variables are not random, and no exact linear relation between variables.
R2
evaluates the overall effectiveness of the entire set of independent variables in explaining the dependent variable
R2 = total variation-unexplained variation / total variation
R2 = SST-SSE / SST
R2 = explained variation / total variation
R2 = RSS / SST
Adjusted R2
R2a =1- [(n-1/n-k-1)×(1-R2)]
AIC
better forecast
BIC
goodness of fit
nested models
one model, called the full model or unrestricted model, has a higher number of independent variables
Restricted model
subset of the independent variables
F-statistic
F= (SSER-SSEU)/q / (SSEU)/(n-k-1)
=(RSSU)/k / (SSEU)/(n-k-1)
reject H0 if
F (test-statistic) > Fc (critical value)
F-test evaluates whether
the relative decrease in SSE due to the inclusion of q additional variables is statistically justified.
Regression model specification
selection of the explanatory (independent) variables to be included in a model
Examples of Misspecification of Functional Form
Misspecification #1: Omitting a Variable
Misspecification #2: Variable Should Be Transformed
Misspecification #3: Inappropriate Scaling of the Variable
Misspecification #4: Incorrectly Pooling Data
Omission of important independent variable(s) effect
Biased and inconsistent regression parameters
serial correlation or heteroskedasticity
Inappropriate variable form effect
heteroskedasticity
Inappropriate variable scaling effect
heteroskedasticity or multicollinearity
Data improperly pooled effect
heteroskedasticity or serial correlation
Omission of important independent variable(s)
one or more variables that should have been included are omitted.
Inappropriate variable form
The relationship between the dependent and independent variables may be non-linear.
Inappropriate variable scaling
Variables may need to be transformed
Data improperly pooled
Sample has periods of dissimilar economic environments
Heteroskedasticity
variance of the residuals is not the same across all observations
Unconditional heteroskedasticity
heteroskedasticity is not related to the level of the independent variables
Conditional heteroskedasticity
related to (i.e., conditional on) the level of the independent variables
Effect of Heteroskedasticity on Regression Analysis
1- standard errors are unreliable estimates. (Type I errors)
2- F-test is unreliable.
Detecting Heteroskedasticity
scatter plots
Breusch Pagan chi-square (χ2) test
BP chi-square test statistic=
n × R2resid with k degrees of freedom
n=the number of observations
k=the number of independent variables
To correct for conditional heteroskedasticity
robust standard errors, used to recalculate the t-statistics
Serialcorrelation / auto correlation
regression residual terms are correlated with one another
Effect of Serial Correlation
Consider a model that employs a lagged value of the dependent variable as one of the independent variables. Residual autocorrelation in such a model causes the estimates of the slope coefficients to be inconsistent. If the model does not have any lagged dependent variables, then the estimates of the slope coefficient will be consistent.
Effect on Standard Errors (Serial Correlation)
Positive serial correlation results in coefficient standard errors that are too small, causing t-statistics (and F-statistic) to be larger than they should be, leads to Type I errors.
Detecting Serial Correlation
Durbin–Watson (DW) statistic
Breusch–Godfrey (BG) test
Breusch–Godfrey (BG) test
regresses the regression residuals against the original set of independent variables, plus one or more additional variables representing lagged residual(s)
Correction for Serial Correlation
robust standard errors used to recalculate the t-statistics using the original regression coefficients.
Multicollinearity
two or more of the independent variables in a multiple regression are highly correlated with each other.
Effect of Serial Correlation on Model Parameters
slope coefficients are imprecise and unreliable
inflates standard errors and lowers t-stats.
Effect on Standard Errors (Multicollinearity)
Standard errors are too high. This leads to Type II errors.
Detection of Multicollinearity
variance inflation factor (VIF)
variance inflation factor (VIF) formula
VIFj = 1 / (1 – R2j)
VIF values
greater than 5 (i.e., R2 > 80%) warrant further investigation,
above 10 (i.e., R2 > 90%) indicates severe multicollinearity.
Correction for Multicollinearity
omit one or more of the correlated independent variables
increase sample size
We can identify outliers using
studentized residuals
studentized residuals steps
extreme observations that, when excluded, cause a significant change to model coefficients.
Detecting Influential Data Points
Cook’s distance
Cook’s distance
composite metric (i.e., it takes into account both the leverage and outliers) for evaluating if a specific observation is influential.
Cooks Distance Formula
D = e2/k×MSE x [h(1-h2)]
k= number of independent variables
MSE = mean square error of the regression model
h= leverage value
Cooks Distance (Di values)
greater than √k/n indicate influential data point
Logistic regression (logit) models
ln(p1−p) = b0+b1X1+b2X2+…+ε
likelihood ratio (LR)
LR = –2 (log likelihood restricted model – log likelihood unrestricted model)
likelihood ratio (LR) values
Negative . (closer to 0) indicate a better-fitting restricted model.
high-leverage points
extreme observations of the independent or ‘X’ variables
Influential data points
extreme observations that when excluded cause a significant change in model coefficients
Qualitative independent variables (dummy variables)
effect of a binary independent variable