Quantitative Methods Flashcards
Multiple Regression
A model that allows for consideration of multiple underlying influences (independent variables) on the dependent variable.
What is multiple regression used for?
- Identify relationships between variables
- Forecast Variables
- Test existing theories
Multiple Regression model
The general multiple linear regression model is:
Yi = b0 + b1X1i + b2X2i + … + bkXki + εi
where:
Yi= ith observation of the dependent variable Y, i = 1, 2, …, n
Xj= independent variables, j = 1, 2, …, k\
Xji= ith observation of the jth independent variable
b0= intercept term
bj= slope coefficient for each of the independent variables
εi= error term for the ith observation
n= number of observations
k= number of independent variables
For Level II, in order to interpret regression results, we can alternatively use the p-value to evaluate the null hypothesis that a slope coefficient is equal to zero.
The p-value is the smallest level of significance for which the null hypothesis can be rejected. We test the significance of coefficients by comparing the p-value to the chosen significance level:
If the p-value is less than the significance level, the null hypothesis can be rejected.
If the p-value is greater than the significance level, the null hypothesis cannot be rejected.
Formulating the Multiple Regression Equation
The authors formulated the following regression equation using annual data (46 observations):
EG10 = b0 + b1PR + b2YCS + ε
The results of this regression are shown in Coefficient and Standard Error Estimates for Regression of EG10 on PR and YCS.
Coefficient and Standard Error Estimates for Regression of EG10 on PR and YCS
Coefficient Standard Error
Intercept –11.6% 1.657%
PR 0.25 0.032
YCS 0.14 0.280
Intercept Term
is the value of the dependent variable when the independent variables are all equal to zero.
Intercept term: If the dividend payout ratio is zero and the slope of the yield curve is zero, we would expect the subsequent 10-year real earnings growth rate to be –11.6%.
partial slope coefficients
Multiple regression is sometimes called this because each slope coefficient is the estimated change in the dependent variable for a 1-unit change in that independent variable, holding the other independent variables constant.
PR coefficient: If the payout ratio increases by 1%, we would expect the subsequent 10-year earnings growth rate to increase by 0.25%, holding YCS constant.
YCS coefficients
If the yield curve slope increases by 1%, we would expect the subsequent 10-year earnings growth rate to increase by 0.14%, holding PR constant.
Q-Q Plot
A normal Q-Q plot (normally called simply a Q-Q plot), is used to compare a variable’s distribution to that of a normal distribution. We can employ a Q-Q plot to evaluate the standardized residuals of a regression model: the residuals should lie along a diagonal if they follow a normal distribution. Recall that 5% of normally distributed observations should be below –1.65 standard deviations.
Coefficient of Determination, R2
R2 evaluates the overall effectiveness of the entire set of independent variables in explaining the dependent variable.
ANOVA TABLE
The results of the ANOVA procedure are presented in an ANOVA table, which accompanies a multiple regression output.
Analysis of variance (ANOVA)
Is a statistical test that compares the means of more than two groups and separates the variability into random and systematic factors.
Heteroskedasticity
occurs when the variance of the residuals is not the same across all observations in the sample. This happens when there are subsamples that are more spread out than the rest of the sample.
Overfitting
Is a concept in data science, which occurs when a statistical model fits exactly against its training data. When this happens, the algorithm unfortunately cannot perform accurately against unseen data, defeating its purpose1. Broadly speaking, overfitting means our training has focused on the particular training set so much that it has missed the point entirely. In this way, the model is not able to adapt to new data as it’s too focused on the training set2.
Unconditional heteroskedasticity
occurs when the heteroskedasticity is not related to the level of the independent variables, which means that it doesn’t systematically increase or decrease with changes in the value of the independent variable(s). While this is a violation of the equal variance assumption, it usually causes no major problems with the regression.
Nested Models
models such that one model, called the full model or unrestricted model, has a higher number of independent variables while another model, called the restricted model, has only a subset of the independent variables.
Conditional heteroskedasticity
is heteroskedasticity that is related to (i.e., conditional on) the level of the independent variables. For example, conditional heteroskedasticity exists if the variance of the residual term increases as the value of the independent variable increases, as shown in Conditional Heteroskedasticity.
Conditional Heteroskedasticity
Conditional Heteroskedasticity the residual variance associated with the larger values of the independent variable, X, is larger than the residual variance associated with the smaller values of X.) Conditional heteroskedasticity does create significant problems for statistical inference.
Effect of Conditional Heteroskedasticity on Regression Analysis
There are two effects of conditional heteroskedasticity that you should be aware of:
- The standard errors are usually unreliable estimates. (For financial data, these standard errors are usually underestimated, resulting in Type I errors.)
- The F-test for the overall model is also unreliable.
Breusch-Pagan (BP) test
Used to detect conditional heteroskedasticity. The BP test calls for the squared residuals (as the dependent variable) to be regressed on the original set of independent variables.
Serialcorrelation
Also known as autocorrelation, refers to a situation in which regression residual terms are correlated with one another: that is not independent. Serial correlation can pose a serious problem with regressions using time series data.
NOTE: Serial correlation observed in financial data (not residuals, which is our discussion here) indicates a pattern that can be modeled. This idea is covered in our reading on time series analysis.
Positiveserial correlation
exists when a positive residual in one time period increases the probability of observing a positive residual in the next time period.
Negativeserial correlation
occurs when a positive residual in one period increases the probability of observing a negative residual in the next period.
Breusch-Godfrey (BG) test
Durbin-Watson (DW) statistic
Residual serial correlation at a single lag can be detected using the Durbin-Watson (DW) statistic
The BG test regresses the regression residuals against the original set of independent variables, plus one or more additional variables representing lagged residual
robuststandard errors
(also called Newey–West corrected standard errors or heteroskedasticity-consistent standard errors), used to correct for serial correlation in regression residuals
Multicollinearity
refers to the condition when two or more of the independent variables, (or linear combinations of three or more independent variables), in a multiple regression are highly correlated with each other. This condition inflates standard errors and lowers t-stats.
variance inflation factor (VIF)
we can quantify multicollinearity using the variance inflation factor (VIF) for each of the independent variables. We start by regressing one of the independent variable “j” against the remaining independent variables.
high-leverage points
are the extreme observations of the independent (or ‘X’) variables.
Outliers
are extreme observations of the dependent (or ‘Y’) variable
Leverage
is a measure of the distance between the jth observation of independent variable i relative to its sample mean. Leverage takes a value between 0 and 1. The higher the value of leverage, the greater the distance—and hence the higher the potential influence of the observation—on the estimated regression parameters.
Influential data points
are extreme observations that, when excluded, cause a significant change to model coefficients.
studentized residuals
Used to identify outliers
Cook’s distance (Di)
is a composite metric (i.e., it takes into account both the leverage and outliers) for evaluating if a specific observation is influential.
influence plot
visually shows the three metrics for each observation.
dummy variables
A dummy variable is a variable that takes values of 0 and 1, where the values indicate the presence or absence of something (e.g., a 0 may indicate a placebo and 1 may indicate a drug).
intercept dummy
An intercept dummy variable is a dummy variable that shifts the constant term in a regression model123. It allows for a change in the intercept to classify different groups4.
linear trend
is a time series pattern that can be graphed using a straight line. A downward sloping line indicates a negative trend, while an upward sloping line indicates a positive trend
Trend
Time series has a Trend if a consistent pattern can be see by plotting the data on a graph.
Time Series
A set of observations for a variable over successive periods of time (e.g., monthly stock market returns for the past 10 years)
slope dummy
A slope dummy variable is a dummy variable that adjusts the connection between y and x12.
qualitative dependent variable
a categorical variable, usually a binary variable, which takes on a value of either zero or one. An example of an application requiring the use of a qualitative dependent variable is a model that attempts to estimate the probability of default for a bond issuer. In this case, the dependent variable may take on a value of one in the event of default and zero in the event of no default.
Linear vs. Log-Linear Trend Models
shows a time series that is best modeled with a log-linear trend model rather than a linear trend model.
When should you use Logistic Regression Models?
if the dependent Y variable is discrete
if out independent X variable is qualitative
When should you us multiple regression models?
When the dependent variable is continuous (not discrete) and tere is more than one explanatory variable (more than one dependent variable).
When multiple independent variables determine the outcome of a single dependent variable.
* Dependent Y Variable is continuous
* We have more than one dependent Y variable
Assumption of Regression Models
L.I.I.N.H.
Linearity: Relationship between dependent Y variable and independent X variable is linear
Independent of Errors: Regression residuals are uncorrelated accross observation
Independent: Independent X variable is not random, there is no exact linear relationship between 2 or more independent variables
Normality: Regression residuals are normally distributed
Homoscedasticity: Constant variance of regression residuals
How to determine a variable is significant?
[T-Stat]>1
Degrees oif Freedom for SSR
N-k
Degrees of Freedom for SST
N -1
Degrees of Freedom for SSE
N-K+1
What will happen to adjusted R-Square if we have insignificant variable
Adjusted R-Square decreases
R-Square formula
SSR/SST = Explained Variation / Unexplained Variation
1-(unexplained variation/total variation)
What kind of test is this?
H0: bi = Bi
Ha: bi/= Bi
Two Tail Test
]
What kind of test is this?
H0: bi <= Bi
Ha: bi > Bi
Right tail test
<= - is heading right
Model Misspecification - Omitted Variable
If we omit a significant variable from our model, the error term will capture the missing.
Which of the following charts, when drawn on a grid, has the O column in alternation with the X column, but most likely does not have the column representing volume or time?
A. Candlestick Chart
B. Bar Chart
C. Point and Figure Chart
C. Point and Figure Chart
You need a graph paper to draw a point and figure chart. The X column and O column alternate, but the graph does not have a volume or time representation.
You are an analyst and you need to present some stocks to your supervisor after rating them as outperform, neutral, and underperform. What is the best scale to represent this data?
A. Interval Scale
B. Ordinal Scale
C. Ratio Scale
B. Ordinal Scale
According to the specifications, you need to rate the stocks based on their expected performance in the future, not the performance differences between the asset classes, hence an ordinal scale would be the best option.
When you are analyzing mutually exclusive projects, why shouldn’t you choose the IRR rule over NPV?
A. When using the IRR ranking, you assume the possibility of reinvestment at the opportunity cost of capital, which is not relevant economically, hence less realistic.
B. Discount rates and interest rates from external factors influence NPV rankings
C. NPV uses more conservative reinvestment rates, making it a relevant option
B. Discount rates and interest rates from external factors influence NPV rankings
The NPV rule is hugely dependent on the external market forces to determine the discount rate. This is because of the expectation of reinvestment at the opportunity cost of capital. When using IRR, the assumption is that any cash flow will be reinvested in the project, and for that reason the rankings are not influenced by external discounts or interest rates.
What kind of test is this?
H0: bi => Bi
Ha: bi < Bi
Left tail test
=> is heading left
In the last 24 months, you have obtained the following information concerning the return on an investment:
Mean Return = 15%
Standard Deviation of Returns = 9%
Assuming a 4% risk-free rate, what is the closest figure to the Sharpe ratio for this particular investment?
A. 1.02
B. 1.22
C. 0.33
B. 1.22
The Sharpe ratio is calculated as follows:
(0.15 - 0.04) / 0.09 = 1.22
For a given present value and interest rate, the future value:
A. Increases as the number of compounding periods per year increases.
B. Decrases as the number of compounding periods per year increases
C. remains the sames as the number of compounding periods per year increases
D. remains the same as the number of compounding periods per year decreases
For a given future value and interest rate, the present value:
Jim Wilson planning to purchas