CFA L2 Quant Flashcards
True or false: With financial instruments, we can typically use a one-factor linear regression model?
False, typically we need a multiple regression model.
Multiple regression model
Regression models that allow to see the effects of multiple independent variables on one dependent variable.
Ex: Can the 10-year growth in the S&P 500 (dependent variable (Y)) be explained by the trailing dividend payout ratio of the index’s stocks (independent variable 1 (X1)) and the yield curve slope (independent variable 2 (X2))?
What are the uses of multiple regression models?
- Identify relationships between variables.
- Forecast variables. (ex: forecast CFs or forecast probability of default)
- Test existing theories.
Standard error
A statistical measure that shows how well the sample represents the population
Residual (ε)
The difference between the observed Y value and the predicted Y value (ŷ).
ε = Y - ŷ
OR
Y - (b0 + b1x1 + b2x2 … + bnxn)
P-value
The smallest level of significance for which the null hypothesis can be rejected.
- If the p-value is less than the significance level (α), the null hypothesis can be rejected and if it’s greater it is failed to be rejected.
If the significance level is 5% and the p-value is .06, do we reject the null hypothesis?
No, we fail to reject the null hypothesis.
Assumptions underlying a mutliple regression model:
- A linear relationship exists between the dependent and independent variables.
- The residuals are normally distributed.
- The variance of the error terms is constant.
- The residual of one observation ISN’T correlated w/ another.
- The independent variables ARE NOT random
- There is no linear relationship between any two or more independent variables.
Q-Q plot
A plot used to compare a variable’s distribution to a normal distribution. The residual of the variable’s distribution should lie along a diagonal line if they follow a normal distribution.
True or false: For a standard normal distribution, only 5% of the observations should be beyond -2 standard deviations of 0?
False, only 5% of the observations should be beyond -1.65 standard deviations.
Analysis of variance (ANOVA)
A statistical test used to assess the difference between the means of more than two groups. At its core, ANOVA allows you to simultaneously compare arithmetic means across groups. You can determine whether the differences observed are due to random chance or if they reflect genuine, meaningful differences.
- A one-way ANOVA uses one independent variable.
- A two-way ANOVA uses two or more independent variables.
Coefficient of determination (R^2)
The percentage of the total variation in the dependent variable explained by the independent variable(s).
R^2 = SSR/SST
OR
(SST - SSE) / SST
Ex: R^2 of 0.63 means that the model explains 63% of the variation in the dependent variable.
SSR= regression sum of squares. It’s the sum of the differences between the predicted value and the mean of the dependent variable.
RSS= regression sum of squares. It’s the total variation in the dependent variable explained by the independent variable.
Adjusted R^2
Since R^2 almost always increases as more independent variables are added to the model, we must adjust it.
- If adding an additional independent variable causes the adjusted R^2 to decrease, it’s not worth adding that variable.
Overfitting
When R^2 is high because there is a large # of indepedent variables, rather than a strong explanation.
Akaike’s information criterion (AIC)
Looks at multiple regression models and determines which has the best forecast.
Calculation: (n * ln(SSE/n)) + 2(k+1)
- Lower values indicate a better model.
- Higher k values result in higher values of the criteria.
Schwarz’s Bayesian information criteria (BIC)
Looks at multiple regression models and determines which has a better goodness of fit.
Calculation: (n * ln(SSE/n)) + (ln(n)*(k+1))
- Lower values indicate a better model.
- Higher k values result in higher values of the criteria.
- BIC imposes a higher penalty for overfitting than AIC.
- AIC and BIC are alternatives to R^2 and adjusted R^2 to determine the quality of the regression model.
Nested models
Models that have a full model and an unrestricted model.
Full model vs restricted model
Full model= A linear regression model that uses all k independent variables
Restricted model= A linear regression model that only uses some of the k independent variables
Joint F-Test
Measures how well a set of independent variables, as a group, explains the variation in the dependent variable. Put simply, it tests overall model significance.
Calculation: [ (SSErestricted - SSEunrestricted) / Q ] / [ (SSEunrestricted) / (n - k - 1) ]
* Q = # of excluded variables in the restricted model.
* Decision rule: reject the null hypothesis if F-stat > F critical value.
True or false: We could also use a t-test to evaluate the significance to see which variables are significant?
True, but the F-test provides a more meaningful evaluation since there is likely some amount of correlation among independent variables.
True of false: The F-test will tell us if at least one of the slope coefficients in a multiple regression model is statistically different from 0?
TRUE
True or false: When testing the hypothesis that all the regression coefficients are simultaneously equal to 0, the F-test is always a two tailed test?
False, when testing the hypothesis that all the regression coefficients are simultaneously equal to 0, the F-test is always a one tailed test.
True or false: We can use the regression equation to make predictions about the dependent variable based on forecasted values of the independent variable?
True, we can make predictions.
Predicting the dependent variable from forecasted values of the independent variable:
ŷ = predicted value of the intercept + (X1 * estimated slope coefficient for X1) + (X2 * estimated slope coefficient for X2)…
Functional form misspecifications (A regression suffers from misspecification of the functional form when the functional form of the estimated regression model differs from the functional form of the population regression function):
- Omission of important independent variables: may lead to biased and inconsistent regression parameters OR serial correlation or heteroskedasticity in the residuals.
- Inappropriate variable form (ex: you may need to take the natural log of a variable): may lead to heteroskedasticity in the residuals. This can happen if there is no linear relationship between the independent & dependent variables.
- Inappropriate variable scaling (ex: common-size financial statements): May lead to heteroskedasticity in the residuals or multicollinearity.
- Data improperly pooled: May lead to heteroskedasticity or serial correlation in the residuals.
Heteroskedasticity
When the variance of the residuals is not constant across all observations in the sample. This happens when there are subsamples that are more spread out than the rest of the sample.
Unconditional heteroskedasticity
When the heteroskedasticity is not related to the # of independent variables meaning heteroskedasticity won’t increase/decrease as the amount of independent variables increase/decrease.
- Although it’s a violation of our assumptions, it is usually not a big problem.
Conditional heteroskedasticity
Heteroskedasticity that is related to the # of independent variables. Creates significant problems for statistical interference if not corrected properly.
- Conditional heteroskedasticity DOES NOT affect the slope coefficients. It DOES affect the computed F-stat and and computed T-stat
Effects of conditional heteroskedasticity
If the pattern of heteroskedasticity is low (most observations on the plot are low values): Standard errors (SEE) of the coefficients in a regression are affected by conditional heteroskedasticity and usually become unreliable estimates by being underestimated. This will lead to the T-stat being too large too often and thus rejecting the null too often, a.k.a type 1 error.
- For the F test (MSR/MSE), MSE is underestimated, and therefore the F-stat is often too large leading to the null is rejected too often, a.ka type 1 error.
- If the pattern of heteroskedasticity is high (most observations on the plot are high values): the same errors will happen but in the opposite direction.
How to detect conditional heteroskedasticity
There are two methods of detection: examining scatter plots of the residuals and by using the Breusch-Pagan chi-square test.
How to use scatterplots to detect heteroskedasticity?
Look at a scatterplot of the residuals vs the independent variables. If the variation is constant there is no heteroskedasticity. If it’s not constant, there is heteroskedasticity.
Breusch-Pagan Chi-Square (BP) Test
A test used to detect heteroskedasticity. The BP test calls for the squared residuals (as the dependent variable) to be regressed on the original set of independent variables. If conditional heteroskedasticity is present, the independent variables will significantly contribute to the explanation of the variability in the squared residuals.
- We want a small R^2 when using a BP test.
- This is a one-tailed test because we are only concerned w/ large values.
- Use a chi-square dist. with k df
How to correct heteroskedasticity?
We can use robust standard errors/white-corrected standard errors/heteroskedasticity-consistent standard errors
Serial correlation/autocorrelation
When residuals are correlated with each other.
- Poses serious problems when using time series data.
Positive serial correlation
When a positive residual in one time period increases the probability of observing a positive residual in the next time period.
- This type of correlation typically results in coefficient standard errors that are too small, causing T-stats or F-stats to be too large, which will lead to type 1 errors.
Effect of serial correlation on model parameters
If the dependent variable’s reaction to the independent variable has a lag in a regression model, serial correlation causes the estimates of the slope coefficients to be inconsistent. If there is no lag, then the estimates of the slope coefficient will be consistent.
How to detect serial correlation?
First, we can use a scatter plot. This will show very dramatic scenarios. We can also use a Durbin-Watston (DW) statistic or a Breusch-Godfrey (BG) test. The DW statistic is used to detect serial correlation at a single lag, whereas a BG test is used to detect serial correlation at multiple lags.
- The lower limit for the DW table is 15 observations.
Breusch-Godfrey (BG) Test
The BG Test regresses the residuals against the original set of independent variables, plus one or more additional variables representing lagged residuals.
Calculation: ε = a1x1 + a2x2… + p1x1 + pnxn
- The null under the BG test is that there is no serial correlation (i.e p1=0).
How to correct for serial correlation?
We can calculate robust standard errors/Newey-West corrected standard errors/heteroskedasticity-consistent standard errors
Multicollinearity
When independent variables in a multiple regression are correlated w/ each other
- This inflates standard errors and lowers t-stats leading to the null failing to be rejected more often (type 2 error).
- Also causes the model’s coefficients to become unreliable.
- Multicollinearity has no effect on an F-stat
Effect of multicollinearity on model parameters
Multicollinearity DOES NOT affect the consistency of slope coefficients. Multicollinearity DOES make those estimates imprecise and unpredictable.
How to detect multicollinearity?
The most easily observable sign is when t-tests indicate none of the individual coefficients are significantly different than zero, but the F-test indicates that at least one of the coefficients is statistically significant and the R^2 is high. This means that none of the individual variables cause variation in the dependent variable but combined together they are highly correlated which washes out the individual effects. More formally we use a variance inflation factor (VIF) for each of the independent variables.
Variance inflation factor (VIF)
Estimates how much of the variation in the dependent variable in a multiple regressions model is due to multicollinearity. We start by regressing one of the independent variables (making it a dependent variable) against the remaining independent variables.
VIF= 1 / (1 - Rj^2)
* VIF values >1 indicates that the variable is not highly correlated with other independent variables.
* VIF values >5 indicate further investigation.
* VIF values >10 indicate high correlation.
Rj^2 is the R^2 of J. J is the independent variable being regressed.
How to correct multicollinearity?
The most common method to correct for multicollinearity is to omit one or more of the highly correlated independent variables. You can also use a proxy for one of the variables or increase the sample size.
True or false: The coefficient on a variable in a multiple regression is the amount of return attributable to the variable?
TRUE
True or false: Using actual instead of expected inflation will improve model specification?
False, using actual instead of expected inflation is likely to result in model misspecification.
Outliers vs high-leverage points
Outliers: Extreme observations in the dependent (Y) variable
High-leverage points: Extreme observations in the independent (X) variable
Leverage (in statistics)
This is a way of identifying extreme observations in the independent variable. A measure of the distance between the xth observation of the independent variable relative to its sample mean. Leverage values will be between 0 and 1. The closer to 1 the farther the distance. If a variable’s leverage is higher than three times the average ((3*(k+1))/n), it is considered potentially influential.
Studentized residuals
An alternative way of identifying outliers than leverage. The studentized residual is the # of standard deviations the data point is from the regression line. For each data point, the residual ÷ standard division is its standardized residual. There are four main steps to this process:
1. Estimate the regression model using the original sample size and then delete one observation and re-estimate the regression. Perform this sequentially deleting a new observation each time.
2. Compare the actual Y values of the deleted observation to the predicted y-values. ei= Y-ŷ
3. The studentized residual is the residual in #2 ÷ standard deviation. t= ei / s
4. Compare the studentized residuals to critical values in a t-table using n-k-2 df. Points that fall in the rejection region are termed outliers and potentially influential.
Influential data points
Extreme observations that, when excluded, cause a significant change to model coefficients.
True or false: All outliers and high-leverage points are influential on the regression?
FALSE
Cook’s Distance
A composite metric for evaluating if a high leverage and/or outlier is influential. Cook’s distance measures how much the estimated values of the regression change if certain high leverage points or outliers are deleted from the sample.
Calculation:
D= [ ei^2 / ((K+1) * MSE) ] * [ hi / (1-hx)^2 ]
* hi= leverage value for the xth observation
* ei= the residual for the ith observation
- Values > than √(k/n) indicate the observation is highly likely to be an influential data point.
- Generally, values > 1 indicate highly influential, whereas values > 0.5 indicate the need for further investigation.
Dummy variables
Binary variables with only two options
- When assigning a numerical value, it can only be 0 and 1.
- Always use (n-1) dummy variables to avoid multicollinearity (i.e., 3 dummy variables for 4 quarters in a year).
- Ex: True/falseEx 2: On/off
Dummy variables example:
EPS for four quarters:
EPS = 1.25 + 0.75Q1 - 0.20Q2 + 0.10Q3
Question 1: What this the predicted EPS for Q4?
Answer 1: EPS = 1.25 + 0.75(0) - 0.20(0) + 0.10(0) = 1.25
* omitted quarter shows as the intercept
Question 2: What is the predicted value for Q1?
Answer 2: EPS = 1.25 + 0.75(1) - 0.20(0) + 0.10(0) = 2.00
Question 3: What is the predicted EPS for Q1 of next year?
Answer 3: EPS = 1.25 + 0.75(1) - 0.20(0) + 0.10(0) = 2.00
* This simple model uses average EPS for any specific quarter over the past ten years as a forecast of EPS in its respective quarter of the following year.
Logistic regression (logit) model
Estimates the probability of a DISCRETE binary variable occurring.
Calculation: ln(p/(1-p)) = b0 + b1x1 + b2x2 … + ε
* The intercept value is an estimate of log odds when the values of all independent variables is zero.
* The change in log odds when one of the independent variables change is dependent on the curvature of the function.
* Odds= e^y
* Probability = 1 / (1 + n(p/(1-p))) OR 1 / (1 + e^(-yhat))
- Logit models assume that residuals have a logistic distribution- similar to a normal distribution but with fatter tails.
- Logit models are nonlinear
Likelihood ratio (LR) test
Similar to joint F-test but for logit models. Measures the goodness of fit of a logit model.
Calculation= -2 * (log likelihood restricted model - log likelihood unrestricted model).
- Recall, the restricted model has fewer independent variables.
- Always provides a negative value.
- Values closer to 0 indicate a better-fitting model.
- LR test is a chi-square distribution.
Time-series data
A set of observations taken periodically (most often at equal intervals) at different points in time.
- A key feature of a time series is that new data can be added w/o affecting the existing data.
- Trends can be found by plotting these observations on a graph.
Linear trend
1/2 broad types of trend models. A time-series trend that can be graphed using a straight line. The independent variable will be time. A downward sloping linear trend indicates a negative trend and vice versa for a positive trend.
Simplest form: Y= bo +b1(t) + b2(t) … + ε
Log-linear trend model
1/2 broad types of trend models. This is used to model positive and negative exponential growth. Recall, exponential growth is some constant growth rate (positive or negative). Exponential growth will show a convex curve.
Simplest form: e^(b0 + b1(t))
* b1 is the constant rate of growth.
* Rather than trying to fit the nonlinear data with a linear (straight line) regression, we take the natural log of both sides and transform it into a linear trend line called the log-linear model. This increases the predictive ability of the model.
Form: ln(y) = ln(e^(b0 + b1(t)))
- Financial time series data is often modeled using log-linear trend models.
How to determine if a linear or log-linear trend model should be used?
Plot the data. A linear trend model may be used if the data points are equally distributed above and below the regression line (ex: inflation data is usually modeled with a linear trend model). If, when plotted, the data plots with a curved shape, use a log-linear trend model (ex: financial data- stock indices and stock prices- are often modeled with log-linear trend models).
- If there is serial correlation, we will use an autoregressive model.
True or false: For a time series model without serial correlation, the DW statistic should be approximately equal to 0?
False, for a time series model without serial correlation, the DW statistic should be approximately equal to 2. A DW that significantly differs from 2 suggests that the residuals are correlated.
Autoregressive (AR) model
A time-series model that regresses the dependent variable against one or more lagged values of itself.
Ex: A regression of the sales of a firm against the sales of the firm in the previous month. In this model, past values are used to predict the current value of the variable.
Simplest form: Xt = bo + b1x_t-1 …. bpx_t-p + ε
* Xt= value of time series at time t
* X_t-1= value of time series at time t-1
- DW test stat cannot be used to test for serial correlation in AR model.
Covariance stationary
An AR model is covariance stationary if:
* There is a constant and finite expected value: the expected value is constant over time.
* Constant and finite variance: the volatility around the time series’ mean is constant over time.
* The covariance between any two observations w/ equal distance apart will be equal.
True or false: A nonstationary time series can still produce meaningful results sometimes?
False, we need stationary covariance. A nonstationary time series will produce meaningless results.
True or false: We can use a DW or BG test to test for serial correlation in AR models?
False, we must use a t-test
- We can use a DW or BG test for a TREND model.
T-stat for residual autocorrelations in AR model:
correlation of the error term with the kth lagged error term ÷ (1 ÷ √n)
Standard error = (1 ÷ √n)(n-2)
* dfn= # of observations.
- If data is monthly, check for 12 lags to see if there’s serial correlation. If quarterly, check for 4 lags.
- When there is statistically significant serial correlation in an AR model, it means that the model is incomplete. There’s still some pattern of data in the residuals that the model has failed to reveal.
Mean reversion
When a time-series has a tendency to move towards its mean. In other words, the dependent variable has a tendency to decline when the current value is above the mean and rise when the current value is below the mean. If a time series is at its mean reverting level, the model predicts the next value of the time series will be the same as its current value.
Mean reverting level calculation
Xt = b0 ÷ (1 - b1)
- The model will not be covariance stationary if b1 = 1
- If Xt > than the mean reverting level, the model predicts that x_t+1 will be lower than Xt and vice versa.
- All covariance stationary time series have a finite mean-reverting level.
- As forecasts become more distant, the value of the forecast will be closer to the mean reverting level.
In-sample forecasts
Forecasts that are within the range of data used to estimate the model. This is where we compare how accurate our model is in forecasting the acutal data we used to develop the model.
Out-of-sample forecasts
Forecasts that are made outside of the sample period. This is where we compare how accurate a model is in forecasting the y-variable value for a time period outside the period used to develop the model.
Root mean squared error (RMSE)
Used to compare the accuracy of autoregressive models in forecasting out-of-sample values.
Ex: We have two AR models. To determine which model will more acurately forecast future values, we calculate the RMSE for the out-of-sample data.
- The model with the lower RMSE for the out-of-sample data will have lower forecast error and will be expected to have better predictive power in the future.
True or false: Financial and economic time series inherently exhibit some form of instability or nonstationarity.
True. Since financial/economic conditions are dynamic, the coefficients in one period may be different from those in another period. Model with shorter estimated time periods are usually more stable for this reason. When selecting a time series sample, analysts should understand regulatory changes, changes to the economic environment, etc. If there have been large changes, the model may not be accurate.
True or false: There is a trade-off between statistical reliability in the long run and statistical stability in the short run?
True. Statistical reliability= if you use a long time period, there is more statistical reliability.
Random walk
When, in an AR model, the value of the dependent variable in one period is equal to the value of the series in the previous period plus a random error term.
Form: Xt = X_t-1 + ε
* b0 = 0
* b1 = 1
Random walk with a drift/Unit root
The same concept as a random walk but the intercept term is not equal to zero. Thus, the time series model is expected to increase/decrease by the intercept term and the error term.
Form: Xt = b0 + X_t-1 + ε
* b1 = 1
True or false: A random walk with or w/o a drift is NOT covariance stationary?
True, random walks will always have a unit root which makes them not covariance stationary.
Why are unit roots problematic?
A unit root is when b1 = 1. If this occurs, then the mean reverting level (b0 ÷ (1 - b1)) is undefined.
How to determine whether a time series is covariance stationary:
- We can run an AR model and examine autocorrelations
- Perform a Dickey-Fuller test
- We cannot use a T-test