Quantitative Methods Flashcards
Multiple Regression
A model that allows for consideration of multiple underlying influences (independent variables) on the dependent variable.
What is multiple regression used for?
- Identify relationships between variables
- Forecast Variables
- Test existing theories
Multiple Regression model
The general multiple linear regression model is:
Yi = b0 + b1X1i + b2X2i + … + bkXki + εi
where:
Yi= ith observation of the dependent variable Y, i = 1, 2, …, n
Xj= independent variables, j = 1, 2, …, k\
Xji= ith observation of the jth independent variable
b0= intercept term
bj= slope coefficient for each of the independent variables
εi= error term for the ith observation
n= number of observations
k= number of independent variables
For Level II, in order to interpret regression results, we can alternatively use the p-value to evaluate the null hypothesis that a slope coefficient is equal to zero.
The p-value is the smallest level of significance for which the null hypothesis can be rejected. We test the significance of coefficients by comparing the p-value to the chosen significance level:
If the p-value is less than the significance level, the null hypothesis can be rejected.
If the p-value is greater than the significance level, the null hypothesis cannot be rejected.
Formulating the Multiple Regression Equation
The authors formulated the following regression equation using annual data (46 observations):
EG10 = b0 + b1PR + b2YCS + ε
The results of this regression are shown in Coefficient and Standard Error Estimates for Regression of EG10 on PR and YCS.
Coefficient and Standard Error Estimates for Regression of EG10 on PR and YCS
Coefficient Standard Error
Intercept –11.6% 1.657%
PR 0.25 0.032
YCS 0.14 0.280
Intercept Term
is the value of the dependent variable when the independent variables are all equal to zero.
Intercept term: If the dividend payout ratio is zero and the slope of the yield curve is zero, we would expect the subsequent 10-year real earnings growth rate to be –11.6%.
partial slope coefficients
Multiple regression is sometimes called this because each slope coefficient is the estimated change in the dependent variable for a 1-unit change in that independent variable, holding the other independent variables constant.
PR coefficient: If the payout ratio increases by 1%, we would expect the subsequent 10-year earnings growth rate to increase by 0.25%, holding YCS constant.
YCS coefficients
If the yield curve slope increases by 1%, we would expect the subsequent 10-year earnings growth rate to increase by 0.14%, holding PR constant.
Q-Q Plot
A normal Q-Q plot (normally called simply a Q-Q plot), is used to compare a variable’s distribution to that of a normal distribution. We can employ a Q-Q plot to evaluate the standardized residuals of a regression model: the residuals should lie along a diagonal if they follow a normal distribution. Recall that 5% of normally distributed observations should be below –1.65 standard deviations.
Coefficient of Determination, R2
R2 evaluates the overall effectiveness of the entire set of independent variables in explaining the dependent variable.
ANOVA TABLE
The results of the ANOVA procedure are presented in an ANOVA table, which accompanies a multiple regression output.
Analysis of variance (ANOVA)
Is a statistical test that compares the means of more than two groups and separates the variability into random and systematic factors.
Heteroskedasticity
occurs when the variance of the residuals is not the same across all observations in the sample. This happens when there are subsamples that are more spread out than the rest of the sample.
Overfitting
Is a concept in data science, which occurs when a statistical model fits exactly against its training data. When this happens, the algorithm unfortunately cannot perform accurately against unseen data, defeating its purpose1. Broadly speaking, overfitting means our training has focused on the particular training set so much that it has missed the point entirely. In this way, the model is not able to adapt to new data as it’s too focused on the training set2.
Unconditional heteroskedasticity
occurs when the heteroskedasticity is not related to the level of the independent variables, which means that it doesn’t systematically increase or decrease with changes in the value of the independent variable(s). While this is a violation of the equal variance assumption, it usually causes no major problems with the regression.
Nested Models
models such that one model, called the full model or unrestricted model, has a higher number of independent variables while another model, called the restricted model, has only a subset of the independent variables.
Conditional heteroskedasticity
is heteroskedasticity that is related to (i.e., conditional on) the level of the independent variables. For example, conditional heteroskedasticity exists if the variance of the residual term increases as the value of the independent variable increases, as shown in Conditional Heteroskedasticity.
Conditional Heteroskedasticity
Conditional Heteroskedasticity the residual variance associated with the larger values of the independent variable, X, is larger than the residual variance associated with the smaller values of X.) Conditional heteroskedasticity does create significant problems for statistical inference.
Effect of Conditional Heteroskedasticity on Regression Analysis
There are two effects of conditional heteroskedasticity that you should be aware of:
- The standard errors are usually unreliable estimates. (For financial data, these standard errors are usually underestimated, resulting in Type I errors.)
- The F-test for the overall model is also unreliable.
Breusch-Pagan (BP) test
Used to detect conditional heteroskedasticity. The BP test calls for the squared residuals (as the dependent variable) to be regressed on the original set of independent variables.
Serialcorrelation
Also known as autocorrelation, refers to a situation in which regression residual terms are correlated with one another: that is not independent. Serial correlation can pose a serious problem with regressions using time series data.
NOTE: Serial correlation observed in financial data (not residuals, which is our discussion here) indicates a pattern that can be modeled. This idea is covered in our reading on time series analysis.
Positiveserial correlation
exists when a positive residual in one time period increases the probability of observing a positive residual in the next time period.
Negativeserial correlation
occurs when a positive residual in one period increases the probability of observing a negative residual in the next period.
Breusch-Godfrey (BG) test
Durbin-Watson (DW) statistic
Residual serial correlation at a single lag can be detected using the Durbin-Watson (DW) statistic
The BG test regresses the regression residuals against the original set of independent variables, plus one or more additional variables representing lagged residual