Quantitative Methods Flashcards
1.1 Basics of Multiple Regression Underlying Assumptions
– describe the types of investment problems addressed by multiple linear regression and the regression process
– formulate a multiple linear regression model, describe the relation between the dependent variable and several independent variables, and interpret estimated regression coefficients
– explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions
Uses of Multiple Linear Regression
Multiple linear regression is a statistical method used to analyze relationships between a dependent variable (explained variable) and two or more independent variables (explanatory variables). This method is often employed in financial analysis, such as examining the impact of GDP growth, inflation, and interest rates on real estate returns.
1- Nature of Variables:
– The dependent variable (Y) is the outcome being studied, such as rate of return or bankruptcy status.
– The independent variables (X) are the predictors or explanatory factors influencing the dependent variable.
2- Continuous vs. Discrete Dependent Variables:
– If the dependent variable is continuous (e.g., rate of return), standard multiple linear regression is appropriate.
– For discrete outcomes (e.g., bankrupt vs. not bankrupt), logistic regression is used instead.
– Independent variables can be either continuous (e.g., inflation rates) or discrete (e.g., dummy variables).
3- Forecasting Future Values:
– Regression models are built to forecast future values of the dependent variable based on the independent variables.
– This involves an iterative process of testing, refining, and optimizing the model.
– A robust model must satisfy the assumptions of multiple regression, ensuring it provides a statistically significant explanation of the dependent variable.
4- Model Validation:
– A good model exhibits an acceptable goodness of fit, meaning it explains the variation in the dependent variable effectively.
– Models must undergo out-of-sample testing to verify predictive accuracy and robustness in real-world scenarios.
5- Practical Applications:
– Regression models are widely used in finance, economics, and business to understand complex relationships and make informed decisions.
– For example, they can assess the drivers of asset performance, predict financial distress, or evaluate economic impacts on industries.
Key Steps in the Regression Process:
Define the Relationship:
– Begin by determining the variation of the dependent variable (Y) and how it is influenced by the independent variables (X).
Determine the Type of Regression:
– If the dependent variable is continuous (e.g., rates of return): Use multiple linear regression.
– If the dependent variable is discrete (e.g., bankrupt/not bankrupt): Use logistic regression.
Estimate the Regression Model:
– Build the model based on the selected independent and dependent variables.
Analyze Residuals:
– Residuals (errors) should follow normal distribution patterns.
– If they don’t, adjust the model to improve fit.
Check Regression Assumptions:
– Ensure assumptions like linearity, independence of residuals, and homoscedasticity (constant variance of residuals) are satisfied.
– If not, return to model adjustment.
Examine Goodness of Fit:
– Use measures like R^2
and adjusted R^2 to evaluate how well the independent variables explain the variation in the dependent variable.
Check Model Significance:
– Use hypothesis testing (e.g., p-values, F-tests) to assess the overall significance of the model.
Validate the Model:
– Determine if this model is the best possible fit among alternatives by comparing performance across different validation metrics.
Use the Model:
– If all criteria are met, use the model for analysis and prediction.
Iterative Nature:
If at any step the assumptions or fit are not satisfactory, analysts refine the model by adjusting variables, transforming data, or exploring alternative methodologies. This ensures the final model is both robust and accurate for predictive or explanatory purposes.
Basics of Multiple Linear Regression
1- Objective:
– The purpose of multiple linear regression analysis is to explain the variation in the dependent variable (Y), referred to as the sum of squares total (SST).
2- SST Formula:
SST = n_∑_i=1 (Yi - Y_bar)^2
– Where:
— Yi: Observed value of the dependent variable.
— Y_bar: Mean of the observed dependent variable.
3- General Regression Model:
Yi = b0 + b1X1i + b2X2i + … + bkXki + ei
– Where:
— Yi: Dependent variable for observation i.
— b0: Intercept, representing the expected value of Yi when all X values are zero.
— b1, b2, …, bk: Slope coefficients, which quantify the effect of a one-unit change in the corresponding X variable on Y, while holding other variables constant.
— X1i, X2i, …, Xki: Values of the independent variables for the i-th observation.
— ei: Error term for observation i, representing random factors not captured by the model.
4- Key Features:
– A model with k partial slope coefficients will include a total of k+1 regression coefficients (including the intercept).
– The intercept (b0) and slope coefficients (b1, b2, …, bk) together describe the relationship between the independent variables (X) and the dependent variable (Y).
Assumptions for Valid Statistical Inference in Multiple Regression
1- Linearity:
– The relationship between the dependent variable (Y) and the independent variables is linear.
2- Homoskedasticity:
– The variance of the residuals (e_i) is constant for all observations.
3- Independence of Observations:
– The pairs (X, Y) are independent of each other.
– Residuals are uncorrelated across observations.
4- Normality:
– Residuals (e_i) are normally distributed.
5- Independence of the Independent Variables:
– 5a. Independent variables are not random.
– 5b. There is no exact linear relationship (no multicollinearity) between any of the independent variables.
Assessing Violations of Regression Assumptions (Graphical Approach)
1- Linearity:
– Respected: The scatterplot of the dependent variable against each independent variable shows a clear linear trend (straight-line pattern).
– Not Respected: The scatterplot shows a non-linear pattern (curved or irregular trends), indicating the need for transformations or additional variables to capture the relationship.
2- Homoskedasticity: Dots randomly below and above the 0 line (no patterns)
– Respected: A plot of residuals against predicted values shows points evenly scattered around a horizontal line, with no discernible pattern or clusters.
– Not Respected: The plot shows a cone-shaped or fan-shaped pattern, indicating that the variance of residuals increases or decreases systematically (heteroskedasticity).
3- Independence of Observations: Dots randomly below and above the 0 line (no patterns)
– Respected: Residuals plotted against independent variables or observation order show a flat trendline with no clustering or patterns.
– Not Respected: The plot reveals systematic trends, cycles, or clustering in residuals, suggesting autocorrelation or dependence between observations.
4- Normality:
– Respected: A Q-Q plot shows residuals closely aligned along the straight diagonal line, indicating they follow a normal distribution.
– Not Respected: Residuals deviate significantly from the diagonal line, particularly at the tails, indicating a non-normal distribution (e.g., “fat-tailed” or skewed residuals).
5- Independence of Independent Variables (Multicollinearity):
– Respected: Pairs plots between independent variables show no clear clustering or trendlines, suggesting low correlation.
– Not Respected: Pairs plots show strong clustering or clear linear relationships between independent variables, indicating multicollinearity that could distort regression results.
Outliers can have very big impacts, but they are not necessarly bad
1.2 Evaluating Regression Model Fit and Interpreting Model Results
– evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit
– formulate hypotheses on the significance of two or more coefficients in a multiple regression model and interpret the results of the joint hypothesis tests
– calculate and interpret a predicted value for the dependent variable, given the estimated regression model and assumed values for the independent variable
Coefficient of Determination (R^2)
Definition:
– R^2 measures the proportion of the variation in the dependent variable (Y) that is explained by the independent variables in a regression model.
– It reflects how well the regression line fits the data points.
Formula:
– R^2 = (Sum of Squares Regression) / (Sum of Squares Total)
Key Notes:
– R^2 can also be computed by squaring the Multiple R value provided by regression software.
– It is a common measure for assessing the goodness of fit in regression models but has limitations in multiple linear regression:
Limitations of R^2 in Multiple Linear Regression:
– It does not indicate whether the coefficients are statistically significant.
– It fails to reveal biases in the estimated coefficients or predictions.
– It does not reflect the model’s overall quality.
– High-quality models can have low R^2 values, while low-quality models can have high R^2 values, depending on context.
Adjusted R^2
Definition:
– Adjusted R^2 accounts for the degrees of freedom in a regression model, addressing the key problem of R^2: its tendency to increase when additional independent variables are included, even if they lack explanatory power.
– It helps prevent overfitting by penalizing models for unnecessary complexity.
Formula:
– Adjusted R^2 = 1 - [(Sum of Squares Error) / (n - k - 1)] ÷ [(Sum of Squares Total) / (n - 1)]
– Alternate Formula: Adjusted R^2 = 1 - [(n - 1) / (n - k - 1)] × (1 - R^2)
Key Implications:
– Adjusted R^2 is always less than or equal to R^2 because it penalizes models with more independent variables.
– Unlike R^2, Adjusted R^2 can be negative if the model explains less variation than expected.
– Adjusted R^2 increases only if the added independent variable improves the model’s explanatory power significantly.
Relationship with t-statistic:
– If the absolute value of the new coefficient’s t-statistic is greater than 1, Adjusted R^2 will increase.
– If the absolute value of the new coefficient’s t-statistic is less than 1, Adjusted R^2 will decrease.
Analysis of Variance (ANOVA)
Purpose:
– ANOVA is a statistical method used in regression analysis to break down the variation in a dependent variable into two components: explained and unexplained variance.
Key Components:
– Sum of Squares Total (SST): The total variation in the dependent variable.
– Sum of Squares Regression (SSR): The portion of the variation explained by the regression model.
– Sum of Squares Error (SSE): The residual variation not explained by the model.
Relationship: SST = SSR + SSE.
Application:
– The data in an ANOVA table can be used to calculate R^2 and Adjusted R^2 for a regression model.
– Example: If SST = 136.428, SSR = 53.204, and SSE = 83.224, these values can be plugged into the formula for Adjusted R^2 to verify its accuracy.
Limitations of Adjusted R^2:
– Unlike R^2, Adjusted R^2 cannot be interpreted as the proportion of variance explained.
– Adjusted R^2 does not indicate the significance of regression coefficients or the presence of bias.
– Both R^2 and Adjusted R^2 are limited in assessing a model’s overall fit.
Adjusted R^2 = 1 - [(Sum of Squares Error) / (n - k - 1)] ÷ [(Sum of Squares Total) / (n - 1)]
What is Parsimonious ?
Parsimonious meaning that it includes as few independent variables as possible to adequately explain the variance of the dependent variable.
Measures of Parsimony in Regression Models
– A high-quality multiple linear regression model is parsimonious, meaning it includes as few independent variables as possible to adequately explain the variance of the dependent variable.
– Two key measures of parsimony are:
Akaike’s Information Criterion (AIC).
Schwarz’s Bayesian Information Criterion (SBC) (also referred to as the Bayesian Information Criterion, or BIC).
AIC and SBC Formulas:
AIC Formula:
– AIC = n * ln(SSE/n) + 2 * (k + 1)
SBC Formula:
– SBC = n * ln(SSE/n) + ln(n) * (k + 1)
Explanation of Components:
– n: Number of observations.
– k: Number of independent variables.
– SSE: Sum of squared errors.
Key Notes:
– Both AIC and SBC penalize for additional independent variables to discourage overfitting.
– These measures differ in the penalty term:
– AIC applies a penalty of 2 * (k + 1).
– SBC applies a penalty of ln(n) * (k + 1).
– Lower scores are better for both measures, as they indicate a better model fit relative to complexity.
Mathematical Differences and Practical Implications:
– SBC is more conservative than AIC because ln(n) grows larger than 2 for datasets with more than 7 observations. This means SBC imposes stricter penalties for adding variables.
– These scores are meaningless in isolation and should instead be used to compare models as independent variables are added, removed, or replaced.
Application of AIC and SBC:
– AIC is the preferred measure when a model is meant for predictive purposes.
– SBC is better suited for assessing a model’s goodness of fit for descriptive purposes.
Applications and Key Insights
These measures help compare models as independent variables are added, removed, or replaced.
Example:
– Model A has the lowest SBC score, indicating it is the most parsimonious model for fit.
– Model B has the lowest AIC score, suggesting it is best for forecasting.
Important Note: AIC and SBC scores are relative and should not be interpreted in isolation. They are used to compare models within the same dataset.
t-Test for Individual Coefficients
In regression analysis, a t-test is used to evaluate the statistical significance of individual slope coefficients in a multiple regression model. The goal is to determine if a given independent variable has a meaningful impact on the dependent variable.
Null and Alternative Hypotheses:
– To assess whether a slope coefficient is statistically significant, analysts test the following hypotheses:
– Null hypothesis (H₀): bᵢ = Bᵢ
(The slope coefficient is equal to the hypothesized value.)
– Alternative hypothesis (Hₐ): bᵢ ≠ Bᵢ
(The slope coefficient differs from the hypothesized value.)
– Default Hypothesis:
– Most often, Bᵢ = 0 is tested, which means the independent variable has no effect on the dependent variable.
t-Statistic Formula:
– The t-statistic is calculated using:
t = (bᵢ - Bᵢ) / s₍bᵢ₎
– Where:
– bᵢ = Estimated value of the slope coefficient.
– Bᵢ = Hypothesized value of the slope coefficient.
– s₍bᵢ₎ = Standard error of the slope coefficient.
Testing the t-Statistic
Comparison with Critical Value:
– The calculated t-statistic is compared with a critical value based on the desired level of significance (α) and degrees of freedom (df).
– Degrees of freedom = n - k - 1, where:
– n = Number of observations.
– k = Number of independent variables.
p-Value Approach:
– Statistical software often calculates a p-value, which indicates the lowest level of significance at which the null hypothesis can be rejected.
– For example:
– If the p-value is 0.03, the null hypothesis can be rejected at a 5% significance level but not at a 1% level.
Interpreting Results:
If the t-statistic’s absolute value exceeds the critical value or the p-value is smaller than the chosen significance level (e.g., α = 0.05):
– Reject H₀: The coefficient is statistically significant, suggesting the independent variable has an effect on the dependent variable.
If the t-statistic’s absolute value does not exceed the critical value or the p-value is larger than α:
– Fail to Reject H₀: The coefficient is not statistically significant.
F-Test for Joint Hypotheses
The F-test is used to evaluate whether groups of independent variables in a regression model collectively explain the variation of a dependent variable. Instead of testing each independent variable separately, it tests their combined explanatory power to ensure that they are collectively meaningful.
1- Concept Overview:
Purpose: To determine if adding a group of independent variables significantly improves the explanatory power of the regression model.
2- Comparison of Models:
Unrestricted Model: Includes all independent variables being tested.
Restricted Model: Excludes the variables being tested for joint significance.
These models are referred to as nested models because the restricted model is essentially a subset of the unrestricted model.
Hypotheses:
Null Hypothesis (H₀): The additional variables (SNPT and LEND) do not add explanatory power:
b_SNPT = b_LEND = 0
Alternative Hypothesis (Hₐ): At least one of the additional variables has a statistically significant impact:
b_SNPT ≠ 0 and/or b_LEND ≠ 0
F-Statistic Formula:
The F-statistic is calculated using:
F = [(SSE_R - SSE_U) / q] ÷ [SSE_U / (n - k - 1)]
Where:
SSE_R: Sum of squared errors for the restricted model.
SSE_U: Sum of squared errors for the unrestricted model.
q: Number of restrictions (variables excluded from the restricted model).
n: Number of observations.
k: Number of independent variables in the unrestricted model.
Steps to Perform the F-Test:
1- Compute the F-Statistic:
Calculate the difference in SSE between the restricted and unrestricted models.
Adjust for the number of restrictions (q) and the degrees of freedom in the unrestricted model.
2- Compare with Critical Value:
Find the critical F-value from the F-distribution table based on the significance level (e.g., 5%) and degrees of freedom (numerator: q, denominator: n - k - 1).
3- Decision:
If F > critical value, reject H₀. The additional variables collectively add explanatory power.
If F ≤ critical value, fail to reject H₀. The additional variables do not significantly improve the model.
Example Calculation:
Given data:
SSE_R = 83.224 (Restricted Model B)
SSE_U = 81.012 (Unrestricted Model D)
q = 2 (Two additional variables: SNPT and LEND)
n = 40 (Observations)
k = 5 (Independent variables in unrestricted model)
Compute the F-statistic:
F = [(83.224 - 81.012) / 2] ÷ [81.012 / (40 - 5 - 1)]
F = 0.464
Compare with critical value:
At 5% significance level, critical F-value = 3.276 (for q = 2 and df = 34).
Since F = 0.464 < 3.276, fail to reject H₀.
Conclusion:
The F-test shows that the additional variables (SNPT and LEND) do not significantly improve the explanatory power of the model. Thus, a more parsimonious model (Model B) may be preferred.
General Linear F-Test
The General Linear F-Test is used to assess the overall significance of an entire regression model. Also known as the goodness-of-fit test, it evaluates the null hypothesis that none of the slope coefficients are statistically different from zero. This test determines whether the independent variables, collectively, explain a significant proportion of the variance in the dependent variable.
Formula for the F-Statistic:
F = Mean Square Regression (MSR) ÷ Mean Square Error (MSE)
MSR is the mean square regression, which measures the explained variation.
MSE is the mean square error, which measures the unexplained variation.
Steps for Using the General Linear F-Test
1- Set Up Hypotheses:
Null Hypothesis (H₀): All slope coefficients are equal to zero, meaning the independent variables have no explanatory power.
Alternative Hypothesis (Hₐ): At least one slope coefficient is statistically different from zero.
2- Calculate the F-Statistic:
Divide the MSR by the MSE.
MSR and MSE are provided in the ANOVA table, which shows key data outputs, including the degrees of freedom (df), sum of squares (SS), and their corresponding mean squares.
3- Compare with the Critical Value:
Determine the critical F-value from the F-distribution table using the df numerator (k) and df denominator (n - k - 1) at the desired level of significance (e.g., 5% or 1%).
4- Make a Decision:
If F > critical value, reject H₀. This indicates that at least one independent variable has explanatory power.
If F ≤ critical value, fail to reject H₀, implying no evidence that the independent variables collectively explain the variance.
Example Analysis
Model B (from the ANOVA table):
Regression df (numerator): 3
Residual df (denominator): 36
MSR = 17.735
MSE = 2.312
F-Statistic Calculation: F = 17.735 ÷ 2.312 = 7.671
Critical Values:
5% significance level: 2.866
1% significance level: 4.377
Conclusion: Since F = 7.671 > 4.377, reject the null hypothesis (H₀). There is strong evidence that at least one slope coefficient is statistically significant, indicating that the model has explanatory power.
Key Takeaways :
– The General Linear F-Test assesses the collective significance of all independent variables in a regression model.
– An F-statistic greater than the critical value suggests that the model has predictive utility.
– Results rely on ANOVA outputs, which provide the required data to compute MSR, MSE, and degrees of freedom.
Using Multiple Regression Models for Forecasting
After testing and refining a multiple regression model, it can be employed to forecast the dependent variable by assuming specific values for the independent variables.
Steps to Forecast the Dependent Variable:
Obtain estimates of the parameters:
– Include the intercept (b_0) and slope coefficients (b_1, b_2, …, b_k).
Determine the assumed values for the independent variables:
– Use X_1i, X_2i, …, X_ki as inputs.
Compute the estimated value of the dependent variable:
– Plug the parameters and assumed values into the multiple regression formula.
Considerations for Forecasting:
– Use assumed values for all independent variables, even for those not statistically significant, since correlations between variables are considered in the model.
– Include the intercept term when predicting the dependent variable.
– For valid predictions:
– Ensure the model meets all regression assumptions.
– Assumed values for independent variables should not exceed the data range used to create the model.
– Model error: Reflects the random (stochastic) component of the regression. This contributes to the standard error of the forecast.
– Sampling error: Results from using assumed values derived from external forecasts. This uncertainty affects the out-of-sample predictions.
– Combined model and sampling errors widen the prediction interval for the dependent variable, making it broader than the within-sample error.
1.3 Model Misspecification
– describe how model misspecification affects the results of a regression analysis and how to avoid common forms of misspecification
– explain the types of heteroskedasticity and how it affects statistical inference
– explain serial correlation and how it affects statistical inference
– explain multicollinearity and how it affects regression analysis
Principles of Model Specification
Model specification entails carefully selecting variables for inclusion in a regression model. Following these principles minimizes specification errors and improves the model’s reliability and usability.
Key Principles of Model Specification:
Economic Reasoning:
– The model should be grounded in economic logic, ensuring relationships are not artificially discovered through data mining.
Parsimony:
– A well-specified model is parsimonious, meaning it achieves meaningful results with the minimum necessary variables.
– Remove superfluous or irrelevant variables to avoid overcomplicating the model.
Out-of-Sample Performance:
– The model should perform effectively when applied to out-of-sample data, demonstrating generalizability. (If data is from 1980-2000, then test with 2001-Now)
– Overfitting to in-sample data renders the model impractical for real-world applications.
Appropriate Functional Form:
– The functional form of the variables should match their relationships.
– Adjustments, such as transformations, may be needed if the relationship between variables is non-linear.
Compliance with Multiple Regression Assumptions:
– Ensure the model adheres to all multiple regression assumptions (e.g., linearity, homoskedasticity).
– Revise the model if any violations of these assumptions are detected.
Misspecified Functional Form
Misspecified functional forms occur when a regression model’s structure fails to accurately represent the relationships between variables, leading to biased, inconsistent, or inefficient results. Several common specification errors can result in these issues.
- Omitted Variables:
– Definition: Important independent variables are excluded from the model.
– Consequences:
1 - Uncorrelated Omitted Variables:
- Residuals reflect the impact of the omitted variable.
- Slope coefficients for included variables are unbiased, but the intercept is biased.
- Residuals are not normally distributed, and their expected value is non-zero.
2 - Correlated Omitted Variables: - Error term becomes correlated with included independent variables. - All regression coefficients (intercept and slopes) are biased and inconsistent. - Estimated residuals and standard errors are unreliable, invalidating statistical tests. -- Diagnostic Tool: - A scatter plot of residuals against the omitted variable reveals a strong relationship.
- Inappropriate Form of Variables:
– Definition: A variable that has a non-linear relationship with other variables is included in the model without appropriate transformation.
– Solution: Convert the variable into a suitable form, such as using natural logarithms for financial data.
– Consequences of Error:
- Heteroskedasticity (variance of residuals is not constant).
- Inappropriate Scaling of Variables:
– Definition: Variables in the model have different scales (e.g., data in millions versus billions).
– Solution: Normalize or scale data, such as converting financial statement values into common size terms.
– Consequences of Error:
- Heteroskedasticity.
- Multicollinearity (correlation among independent variables). - Inappropriate Pooling of Data:
– Definition: Combining data points from different structural regimes or periods when conditions were fundamentally different.
– Example: A dataset combining pre- and post-policy change fixed-income returns without accounting for the regime change.
– Solution: Use data only from the period most representative of expected forecast conditions.
– Consequences of Error:
- Heteroskedasticity.
- Serial correlation (error terms are correlated over time).
Homoskedasticity and Heteroskedasticity in Regression
Regression analysis assumes homoskedasticity, meaning the variance of the error term remains constant across all values of the independent variable. If this assumption is violated, the model exhibits heteroskedasticity, where the variance of the error term changes depending on the value of the independent variable.
- Homoskedasticity (Assumption Met):
– Definition: Variance of the error term is constant across all values of the independent variable.
– Graphical Indication:
- Residuals are evenly spread around the regression line across all values of the independent variable (as seen in the left scatterplot).
- No observable pattern in the dispersion of residuals.
- Heteroskedasticity (Violation of Assumption):
– Definition: Variance of the error term increases or decreases with the value of the independent variable.
– Graphical Indication:
- Residuals exhibit a funnel-shaped pattern, either expanding or contracting as the value of the independent variable changes (as seen in the right scatterplot).
- Variance of the residuals is not consistent; larger or smaller variance correlates with higher or lower values of the independent variable.
The Consequences of Heteroskedasticity
Heteroskedasticity can affect the reliability of regression analysis, and its consequences vary depending on whether it is unconditional or conditional.
1- Unconditional Heteroskedasticity:
– Occurs when the error variance is not correlated with the independent variables.
– While it violates the homoskedasticity assumption, it does not cause significant problems for making statistical inferences.
2- Conditional Heteroskedasticity:
– Occurs when the variance of the model’s residuals is correlated with the values of the independent variables.
– This type creates significant problems for statistical inference.
Key Issues: – Mean Squared Error (MSE): Becomes a biased estimator.
– F-Test for Model Significance: Becomes unreliable.
– Standard Errors: Biased estimates of the standard errors for individual regression coefficients.
– t-Tests for Coefficients: Unreliable due to biased standard errors.
Specific Consequences: – Underestimated Standard Errors: Leads to inflated t-statistics.
– Increased Risk of Type I Errors: Analysts are more likely to reject null hypotheses that are actually true, resulting in finding relationships that do not exist.
Correcting for Heteroskedasticity
Heteroskedasticity is not typically expected in perfectly efficient markets where prices follow a random walk. However, many financial datasets exhibit heteroskedastic residuals due to phenomena like volatility clustering. When heteroskedasticity is present, it is essential to correct it to ensure reliable statistical inferences.
Methods for Correction:
1- Robust Standard Errors:
– Adjusts the standard errors of the model’s coefficients to account for heteroskedasticity while leaving the coefficients themselves unchanged.
– Often referred to as heteroskedasticity-consistent standard errors or White-corrected standard errors. – Most statistical software packages provide options to calculate robust standard errors.
2- Generalized Least Squares (GLS):
– Modifies the regression equation to directly address heteroskedasticity in the dataset.
– A more advanced technique whose details are beyond the CFA curriculum’s scope.
Key Takeaways:
– Cause and Correction: Conditional heteroskedasticity often arises in financial data but can be addressed through robust standard errors or GLS.
– Forecasting Potential: Heteroskedasticity, when properly understood, can sometimes reveal inefficiencies that allow analysts to forecast future returns.
– Practical Tools: Statistical software simplifies the implementation of these corrections.
Serial correlation, also known as autocorrelation, is commonly observed when working with time-series data. This violates the assumption that errors are uncorrelated across observations and its impact is potentially more serious than violations of the homoskedasticity assumption.
Consequences of Serial Correlation
Serial correlation (also called autocorrelation) occurs when the error term for one observation is correlated with the error term of another. This can have significant implications for regression analysis, depending on whether any independent variables are lagged values of the dependent variable.
Key Consequences:
1- If Independent Variables Are NOT Lagged Values of the Dependent Variable:
– Standard error estimates will be invalid.
– Coefficient estimates will remain valid.
2- If Independent Variables ARE Lagged Values of the Dependent Variable:
– Both standard error estimates and coefficient estimates will be invalid.
Types of Serial Correlation:
– Positive Serial Correlation:
A positive error in one period increases the likelihood of a positive error in the next period. This is the most common type and is often assumed to be first-order serial correlation, meaning it primarily affects adjacent observations.
– Negative Serial Correlation:
A positive error in one period increases the likelihood of a negative error in the subsequent period.
The implication of positive serial correlation is that the sign of the error term tends to persist across periods.
Practical Effects:
– Positive serial correlation does not affect the consistency of regression coefficient estimates if the independent variables are not lagged values of the dependent variable. However, statistical tests lose validity:
— The F-statistic for overall significance may be overstated because the mean squared error is underestimated.
— t-statistics for individual coefficients may be inflated, increasing the risk of Type I errors (rejecting a true null hypothesis).
Market Implications:
– Like heteroskedasticity, serial correlation should not occur in an efficient market. Persistent patterns caused by serial correlation would create opportunities for excess returns, which would eventually be exploited and eliminated.
Key Takeaways:
– Positive vs. Negative Serial Correlation: Positive correlation is more common and leads to the persistence of error term signs, while negative correlation reverses them.
– Impact on Tests: Invalid standard errors inflate test statistics, leading to unreliable statistical inferences.
– Market Efficiency: Serial correlation suggests inefficiencies in financial markets, offering exploitable opportunities until corrected by market participants.
Testing for Serial Correlation
To detect the presence of serial correlation in regression models, two common methods are the Durbin-Watson (DW) Test and the Breusch-Godfrey (BG) Test. While the DW test is limited to detecting first-order serial correlation, the BG test is more robust as it can detect serial correlation over multiple lag periods.
Steps for the Breusch-Godfrey (BG) Test:
1- Generate Residuals:
– Run the regression model and compute its residuals.
2- Regress Squared Residuals:
– Use the squared residuals as the dependent variable and regress them against:
— The independent variables from the original model.
— The lagged residuals from the original model.
3- Chi-Squared Test:
– Use the chi-squared (χ^2) statistic to test the null hypothesis:
— Null Hypothesis (H₀): There is no serial correlation in the residuals up to lag p.
— Alternative Hypothesis (Hₐ): At least one lag is serially correlated with the residuals.
Key Takeaways:
– The Durbin-Watson Test is simple but limited to first-order serial correlation detection.
– The Breusch-Godfrey Test is more comprehensive and suitable for higher-order serial correlation detection.
– If serial correlation is detected, model adjustments may be needed to ensure valid statistical inferences.
Correcting for Serial Correlation
The most common method of correcting for serial correlation is to adjust the coefficients’ standard errors. While this does not eliminate serial correlation, the adjusted standard errors account for its presence.
Multicollinearity in Regression Models
A key assumption of multiple linear regression is that there are no exact linear relationships between any independent variables. If this assumption is violated, the regression equation cannot be estimated. The concepts of perfect collinearity and multicollinearity are central to understanding this issue:
Perfect Collinearity
– Occurs when one independent variable is an exact linear combination of other independent variables.
– Example: A regression including sales, cost of goods sold (COGS), and gross profit would exhibit perfect collinearity because:
— Sales = COGS + Gross Profit.
Multicollinearity
– Refers to situations where two or more independent variables are highly correlated, but not perfectly.
– The relationship between these variables is approximately linear.
– Multicollinearity is common in financial datasets, where variables often share strong relationships.
mplications
– Perfect collinearity prevents the estimation of regression coefficients.
– Severe multicollinearity can inflate standard errors, reduce the precision of coefficient estimates, and hinder the ability to identify statistically significant predictors.
Detecting and addressing multicollinearity is critical to ensuring the validity of regression analyses.
Consequences of Multicollinearity
While it is possible to estimate a regression model with multicollinearity, this issue has several important implications for the reliability of the regression results:
Key Points:
1- Estimation is Still Feasible:
– Multicollinearity does not prevent the estimation of a regression equation, as long as the relationships between independent variables are less than perfect.
2- Consistency of Estimates:
– Multicollinearity does not affect the consistency of regression coefficient estimates. However, the precision of these estimates is significantly impacted.
3- Imprecise and Unreliable Estimates:
– When multicollinearity is present, the standard errors of the regression coefficients become inflated, leading to:
— Difficulty in determining which independent variables are significant predictors.
— Reduced ability to reject the null hypothesis with a t-test.
4- Interpretation Challenges:
– The inflated standard errors make it problematic to interpret the role and significance of independent variables.
How to Interpret ANOVA and Regression Statistics
1- Significance F (p-value for the F-test):
– Definition: Measures the probability that all variable coefficients are equal to zero (i.e., the independent variables collectively have no explanatory power for the dependent variable).
– Interpretation: A low Significance F (e.g., below 0.05 or 5%) suggests the model is statistically significant. A high Significance F (e.g., 0.563 or 56.3%) indicates the model is not significant and that the independent variables do not explain the variation in the dependent variable.
– Example: In Exhibit 2, a Significance F of 56.3% shows that there is a high probability that all variable coefficients are zero, implying no meaningful seasonality in portfolio returns.
2- R-squared (Coefficient of Determination):
– Definition: Represents the proportion of variance in the dependent variable explained by the independent variables.
– Interpretation: A higher R-squared value indicates a better fit of the model. However, it does not account for the number of variables included.
– Example: In Exhibit 2, an R-squared value of 10.3% means that only 10.3% of the variance in portfolio returns is explained by the independent variables (monthly dummy variables). This is a weak explanatory power.
3- Adjusted R-squared:
– Definition: Adjusts the R-squared value for the number of independent variables in the model, penalizing the inclusion of superfluous variables.
– Interpretation: A negative Adjusted R-squared indicates that the model performs worse than a model with no predictors at all.
– Example: In Exhibit 2, the Adjusted R-squared is -0.014, showing that the inclusion of monthly dummy variables does not meaningfully explain excess portfolio returns and may even degrade model performance.
4- F-statistic:
– Definition: Tests the joint significance of all the independent variables in the model.
– Interpretation: A low F-statistic, paired with a high Significance F, indicates the independent variables collectively have little explanatory power.
– Example: In Exhibit 2, the F-statistic of 0.879, paired with the high Significance F, reinforces the conclusion that the model does not explain the dependent variable effectively.
5- t-Statistic and p-Values for Individual Coefficients:
– Definition: The t-statistic tests the significance of individual coefficients. A high absolute t-statistic (e.g., above 2 in magnitude) and a low p-value (e.g., below 0.05) indicate the variable is statistically significant.
– Interpretation: Non-significant coefficients suggest the variable does not contribute meaningfully to explaining the dependent variable.
– Example: In Exhibit 2, all monthly dummy variables have t-statistics close to zero and high p-values (all above 0.05), confirming that no month has a statistically significant impact on portfolio returns.
6- Coefficients:
– Definition: Represent the magnitude and direction of the relationship between each independent variable and the dependent variable.
– Interpretation: A positive coefficient indicates a positive relationship, while a negative coefficient suggests a negative relationship. However, if the coefficient is not statistically significant, its interpretation is unreliable.
– Example: In Exhibit 2, the coefficients for all months are statistically insignificant, so their values (e.g., -3.756 for February) cannot be interpreted as meaningful relationships.
Detecting Multicollinearity
Multicollinearity occurs when independent variables in a regression model are highly correlated, making it difficult to determine the individual effect of each variable on the dependent variable. Below are the key points and detection methods.
Key Indicators of Multicollinearity:
1- High R^2 and Significant F-Statistic, but Insignificant t-Statistics:
– The regression model explains a large portion of the variation in the dependent variable, but individual variables are not statistically significant.
– This suggests the independent variables collectively explain the dependent variable well, but their individual contributions are unclear due to high correlations among them.
2- Presence Without Obvious Pairwise Correlations:
– In models with multiple independent variables, multicollinearity can exist even if pairwise correlations among variables are low.
– This can happen when groups of independent variables collectively exhibit hidden correlations.
Method for Detection: Variation Inflation Factor (VIF):
1- Purpose of VIF:
– The VIF quantifies how much the variance of a regression coefficient is inflated due to multicollinearity.
2- Calculation of VIF:
– The VIF for a variable j is calculated as:
— VIF = 1 / (1 - R^2_j)
— Where R^2_j is the R^2 value when the variable j is regressed against all other independent variables.
3- Interpreting VIF Values:
– The lowest possible VIF value is 1, indicating no correlation with other independent variables.
– A higher R^2_j leads to a higher VIF value, signaling greater multicollinearity.
– Thresholds for Concern:
— A VIF value of 5 or higher suggests potential multicollinearity that requires further investigation.
— A VIF value above 10 is considered a serious problem and indicates a need for corrective action, such as removing or combining variables.
Correcting for Multicollinearity
The best way to correct for multicollinearity is to exclude one or more of the independent variables. This is often done by trial and error. Other potential solutions include increasing the sample size and using different proxies for an independent variable. Ultimately, multicollinearity may not be a significant concern for analysts who simply wish to predict the value of the dependent variable without requiring a strong understanding of the independent variables.
The Newey-West method, which involves making adjustments to standard errors, is used to correct for both serial correlation and heteroskedasticity.
1.4 Extensions of Multiple Regression
– describe influence analysis and methods of detecting influential data points
– formulate and interpret a multiple regression model that includes qualitative independent variables
– formulate and interpret a logistic regression model
Influential Data Points
Influential data points are observations in a dataset that can significantly alter the results of a regression analysis when included. Analysts must identify and assess these points to understand their influence on the regression model.
Types of Influential Data Points:
1- High-Leverage Points:
– These are extreme values for one or more of the independent variables.
– High-leverage points can disproportionately impact the slope of the regression line.
2- Outliers:
– These are extreme values for the dependent variable.
– Outliers can affect the goodness of fit and the statistical significance of regression coefficients.
Impact on Regression Analysis:
1- High-Leverage Points:
– Tend to distort the slope of the regression line by pulling it toward themselves.
– Removing these points can result in a more accurate representation of the data’s underlying relationships.
2- Outliers:
– Influence the goodness-of-fit measures and regression coefficients.
– May inflate or deflate the significance of certain variables.
Considerations for Analysts:
– Visual Inspection:
— Influential data points can be identified using scatterplots. Dotted regression lines can illustrate how these points tilt the overall slope.
– Further Investigation:
— Extreme values near the regression line may not adversely impact the model but should still be examined to ensure they reflect real-world conditions.
By recognizing and accounting for influential data points, analysts can refine their models to improve accuracy and reliability.
Detecting Influential Points
Influential data points can significantly affect regression results. While scatterplots are helpful for visual identification, quantitative methods are necessary to reliably detect these points.
High-Leverage Points:
1- Leverage Measure (h_ii):
– Quantifies the distance between the i-th value of an independent variable and its mean.
– Values range from 0 to 1, with higher values indicating more influence.
2- Threshold for Influence:
– An observation is considered influential if its leverage measure exceeds:
— h_ii > 3 × (k + 1) / n
– For a model with three independent variables and 40 observations:
— Threshold = 3 × (3 + 1) / 40 = 0.3
Outliers:
1- Studentized Residuals:
– Calculated by removing one observation at a time and measuring its effect on the model.
2- Steps to Calculate Studentized Residuals:
– Run the regression with n observations.
– Remove one observation and rerun the regression with n - 1 observations.
– Repeat this for every observation in the dataset.
– For each observation i:
— Calculate the residual (e_i) as the difference between the observed dependent variable (Y_i) and its predicted value (Y_i) from the model without observation i.
— Compute the standard deviation of these residuals (s_e).
— Calculate the studentized deleted residual (t_i) using:
—- t_i* = e_i / s_e
3- Threshold for Outliers:
– An observation is classified as an outlier if its studentized residual exceeds the critical t-value for n - k - 2 degrees of freedom at the chosen significance level.
– If t_i* > 3, the observation is immediately labeled as an outlier.
General Criteria for Influence:
1- Leverage:
– If h_ii > 3 × (k + 1) / n, the observation is potentially influential.
2- Studentized Residuals:
– If t_i* exceeds the critical t-value, the observation is potentially influential.
Key Takeaways:
– Influence Metrics: Leverage measures influence of independent variables, while studentized residuals measure influence of dependent variables.
– Quantitative Tools: Use h_ii and t_i* to detect influential points and assess their impact on the regression model.
– Further Investigation: Influential points may indicate data errors or model misspecification, such as omitted variables.
Defining a Dummy Variable
When performing multiple regression analysis, financial analysts often create dummy variables (also known as indicator variables) to represent qualitative data. A dummy variable takes the value of 1 if a condition is true and 0 if it is false.
Reasons for Using Dummy Variables:
1- Reflecting Data’s Inherent Properties:
– A dummy variable can indicate inherent characteristics, such as whether a firm belongs to the technology industry (dummy variable = 1) or not (dummy variable = 0).
2- Capturing Identified Characteristics:
– Dummy variables can distinguish observations based on specific conditions. For example:
— Data recorded before the COVID-19 pandemic (dummy variable = 0).
— Data recorded after the onset of the pandemic (dummy variable = 1).
3- Representing True/False Conditions in a Dataset:
– Dummy variables can indicate binary outcomes, such as whether a firm’s revenue exceeds USD 1 billion (dummy variable = 1) or does not (dummy variable = 0).
By converting qualitative data into quantitative format, dummy variables enable analysts to incorporate categorical variables into regression models effectively.
Dummy Variables with Multiple Categories
To distinguish between n categories, it is necessary to create n - 1 dummy variables. The unassigned category serves as the “base” or “control” group, and the coefficients of the dummy variables are interpreted relative to this base group.
Key Consideration:
– If n dummy variables are used instead of n - 1, the regression model will violate the assumption of no exact linear relationships (perfect multicollinearity) between the independent variables.
This ensures that the model remains properly specified and avoids redundancy among the dummy variables.
Visualizing and Interpreting Dummy Variables
General Case: To understand the use of dummy variables in regression, consider a simple linear regression model with one independent variable:
Y = b0 + b1X
– In this formula:
— Y: Dependent variable
— X: Continuous independent variable
— b0: Intercept
— b1: Slope coefficient
Adding a dummy variable can impact either the intercept, the slope, or both.
Intercept Dummies:
– A dummy variable (D) takes the value of:
— 1 if a particular condition is met.
— 0 if the condition is not met.
– Modifying the formula with an intercept dummy becomes:
Y = b0 + d0D + b1X
– Where:
— d0: The vertical adjustment to the intercept when D = 1.
Cases:
1- When the condition is NOT met (D = 0):
Y = b0 + b1X
– The regression follows the original formula.
2- When the condition is met (D = 1):
Y = (b0 + d0) + b1X
– The intercept is adjusted by d0, effectively shifting the regression line up or down.
Visualization:
– If d0 > 0: The line shifts upward.
– If d0 < 0: The line shifts downward.
Intercept and Slope Dummies
This scenario arises when differences between two groups affect both the intercept and the slope of the regression model. The general formula that includes adjustments for both is:
Y = b0 + d0D + b1X + d1(D · X)
Explanation of the Formula:
1- When D = 0 (Control Group):
– The formula reverts to:
Y = b0 + b1X
2- When D = 1 (Non-Control Group):
– The formula adjusts to:
Y = (b0 + d0) + (b1 + d1)X
Interpretation:
b0: The intercept for the control group.
d0: The vertical adjustment to the intercept for the non-control group.
b1: The slope of the regression line for the control group.
d1: The adjustment to the slope for the non-control group.
Visualization: The intercept shifts up or down by d0 when moving between the control and non-control groups.
The slope pivots by d1, causing the regression line to steepen or flatten.
For example:
– If d0 > 0, the intercept moves upward.
– If d1 < 0, the slope becomes flatter for the non-control group.
Testing for Statistical Significance
Dummy variables help distinguish between categories of data. Their statistical significance can be assessed using t-statistics or their corresponding p-values, with the following thresholds:
At the 5% significance level, a slope coefficient is statistically different from zero if the p-value is less than 0.05.
At the 1% significance level, the slope coefficient must have a p-value less than 0.01 to conclude that it has a non-zero value.
Using p-values is generally quicker than directly interpreting t-statistics for assessing statistical significance.
Qualitative Dependent Variables
Qualitative dependent variables, also called categorical dependent variables, are used in forecasting when outcomes are finite. For instance, predicting bankruptcy (1 for bankruptcy, 0 otherwise) or other outcomes with multiple categories.
Key Points:
1- Logit Model:
– Preferred for qualitative dependent variables as it accounts for probabilities and avoids assuming a linear relationship between the dependent and independent variables.
Probability and Odds:
– Probabilities for all outcomes must sum to 1.
– For binary variables:
— If the probability of an outcome is represented by p, the probability of the other outcome is 1 - p.
– Formula for Odds:
– Odds of an event occurring = p / (1 - p)
Where:
— p: Probability of the event occurring.
— 1 - p: Probability of the event not occurring.
This approach ensures probabilities are modeled appropriately, especially for qualitative outcomes.
Logistic Regression (Logit):
In logistic regression, the dependent variable represents the log odds of an event occurring. This is calculated by taking the natural logarithm of the odds:
– Formula for Log Odds:
Log odds = ln(p / (1 - p))
Where:
— ln: Natural logarithm.
— p: Probability of the event occurring.
— 1 - p: Probability of the event not occurring.
The logistic transformation linearizes the relationship between the dependent variable and independent variables, constraining probability estimates to values between 0 and 1.
Assuming three independent variables, the regression equation is:
ln(p / (1 - p)) = b0 + b1X1 + b2X2 + b3X3 + e
Where:
— b0: Intercept.
— b1, b2, b3: Coefficients of independent variables.
— X1, X2, X3: Independent variables.
— e: Residual/error term.
To determine the probability implied by the dependent variable, the equation is rearranged as follows:
– Probability Formula:
p = 1 / [1 + exp(-(b0 + b1X1 + b2X2 + b3X3))]
Where:
— exp: Exponential function.
— p: Probability of the event occurring.
Logit Model vs Linear Probability Model
Key Differences Between Models:
1- Linear Probability Model (LPM):
– Probability estimates can go below 0 or above 1, which is unrealistic.
– The relationship between the independent variable (e.g., X1) and the dependent variable (Y = 1) is linear.
2- Logit Model:
– Constrains probabilities between 0 and 1, producing more realistic results.
– The relationship between X1 and the dependent variable (Y = 1) is non-linear and follows an S-shaped curve (logistic function).
Estimation of Coefficients:
1- Method:
– Coefficients in a logit model are estimated using maximum likelihood estimation (MLE) rather than ordinary least squares (OLS), as used in linear regression.
2- Interpretation of Coefficients:
– Logit coefficients quantify the change in the log odds of the event occurring (Y = 1) per unit change in the independent variable, holding other variables constant.
– Interpretation is less intuitive than linear regression coefficients.
Assessing Model Fit:
1- Likelihood Ratio (LR) Test:
– Similar to the F-test in linear regression, used to assess the overall fit of the logistic regression model.
– The test statistic is based on the difference between the log likelihoods of the restricted and unrestricted models.
2- Null Hypothesis (H₀):
– The restricted model (fewer predictors) is a better fit than the unrestricted model.
– Reject H₀ only if the LR test statistic exceeds the critical chi-square value (one-sided test).
Key Takeaways:
– LPM vs Logit Models: Logit models are superior for ensuring probabilities are constrained between 0 and 1.
– Coefficient Estimation: Logit models use MLE, making coefficient interpretation more complex.
– Model Fit: Use the LR test and pseudo-R^2 to assess and compare logistic regression models.
Pseudo-R^2:
– Logistic regression does not use OLS, so it does not provide an R^2 measure. Instead, a pseudo-R^2 is used to compare model specifications.
– Limitations:
— Not comparable across different datasets.
— Should only be used to compare models with the same dataset.
How to Interpret ANOVA and Regression Statistics
1- Significance F (p-value for the F-test):
– Definition: Measures the probability that all variable coefficients are equal to zero (i.e., the independent variables collectively have no explanatory power for the dependent variable).
– Interpretation: A low Significance F (e.g., below 0.05 or 5%) suggests the model is statistically significant. A high Significance F (e.g., 0.563 or 56.3%) indicates the model is not significant and that the independent variables do not explain the variation in the dependent variable.
– Example: In Exhibit 2, a Significance F of 56.3% shows that there is a high probability that all variable coefficients are zero, implying no meaningful seasonality in portfolio returns.
2- R-squared (Coefficient of Determination):
– Definition: Represents the proportion of variance in the dependent variable explained by the independent variables.
– Interpretation: A higher R-squared value indicates a better fit of the model. However, it does not account for the number of variables included.
– Example: In Exhibit 2, an R-squared value of 10.3% means that only 10.3% of the variance in portfolio returns is explained by the independent variables (monthly dummy variables). This is a weak explanatory power.
3- Adjusted R-squared:
– Definition: Adjusts the R-squared value for the number of independent variables in the model, penalizing the inclusion of superfluous variables.
– Interpretation: A negative Adjusted R-squared indicates that the model performs worse than a model with no predictors at all.
– Example: In Exhibit 2, the Adjusted R-squared is -0.014, showing that the inclusion of monthly dummy variables does not meaningfully explain excess portfolio returns and may even degrade model performance.
4- F-statistic:
– Definition: Tests the joint significance of all the independent variables in the model.
– Interpretation: A low F-statistic, paired with a high Significance F, indicates the independent variables collectively have little explanatory power.
– Example: In Exhibit 2, the F-statistic of 0.879, paired with the high Significance F, reinforces the conclusion that the model does not explain the dependent variable effectively.
5- t-Statistic and p-Values for Individual Coefficients:
– Definition: The t-statistic tests the significance of individual coefficients. A high absolute t-statistic (e.g., above 2 in magnitude) and a low p-value (e.g., below 0.05) indicate the variable is statistically significant.
– Interpretation: Non-significant coefficients suggest the variable does not contribute meaningfully to explaining the dependent variable.
– Example: In Exhibit 2, all monthly dummy variables have t-statistics close to zero and high p-values (all above 0.05), confirming that no month has a statistically significant impact on portfolio returns.
6- Coefficients:
– Definition: Represent the magnitude and direction of the relationship between each independent variable and the dependent variable.
– Interpretation: A positive coefficient indicates a positive relationship, while a negative coefficient suggests a negative relationship. However, if the coefficient is not statistically significant, its interpretation is unreliable.
– Example: In Exhibit 2, the coefficients for all months are statistically insignificant, so their values (e.g., -3.756 for February) cannot be interpreted as meaningful relationships.
Additional Metrics to Interpret
1- Multiple R (Correlation Coefficient):
– Definition: Measures the strength of the linear relationship between the dependent and independent variables.
– Interpretation: A value closer to 1 indicates a strong relationship, while a value closer to 0 indicates a weak relationship. However, it does not measure causation or statistical significance.
– Example: In Exhibit 2, Multiple R = 0.321, showing a weak correlation between portfolio returns and the monthly dummy variables.
2- SS (Sum of Squares):
– Definition: Quantifies the variation in the data. It is broken into three components:
Regression SS: Variation explained by the independent variables.
Residual SS: Unexplained variation (error).
Total SS: Total variation in the dependent variable.
– Interpretation: Higher Regression SS relative to Total SS indicates that the independent variables explain more variation in the dependent variable.
– Example: In Exhibit 2:
Regression SS = 634.679, representing the variation explained by the monthly dummies.
Residual SS = 5511.369, indicating that most of the variation is unexplained by the model.
Total SS = 6146.048, showing the total variation in portfolio returns.
The high Residual SS relative to Total SS confirms that the model does not explain the variation in returns effectively.
3- MS (Mean Square):
– Definition: Represents the average variation. It is calculated as SS divided by degrees of freedom (df):
Regression MS = Regression SS / df (Regression)
Residual MS = Residual SS / df (Residual)
– Interpretation: Lower Residual MS compared to Regression MS indicates better model fit.
– Example: In Exhibit 2:
Regression MS = 57.698, showing the average explained variation per independent variable.
Residual MS = 65.612, indicating higher average unexplained variation, further confirming the model’s poor fit.
4- F (F-statistic):
– Definition: Measures the ratio of explained variance to unexplained variance. It is calculated as:
– Interpretation: Higher F values suggest the model explains a significant portion of the variance. However, its significance is evaluated using the Significance F.
– Example: In Exhibit 2, F = 0.879, which is very low, paired with a high Significance F of 56.3%, indicating the model is not statistically significant.
1.5 Time-Series Analysis
– calculate and evaluate the predicted trend value for a time series, modeled as either a linear trend or a log-linear trend, given the estimated trend coefficients
– describe factors that determine whether a linear or a log-linear trend should be used with a particular time series and evaluate limitations of trend models
– explain the requirement for a time series to be covariance stationary and describe the significance of a series that is not stationary
– describe the structure of an autoregressive (AR) model of order p and calculate one- and two-period-ahead forecasts given the estimated coefficients
– explain how autocorrelations of the residuals can be used to test whether the autoregressive model fits the time series
– explain mean reversion and calculate a mean-reverting level
– contrast in-sample and out-of-sample forecasts and compare the forecasting accuracy of different time-series models based on the root mean squared error criterion
– explain the instability of coefficients of time-series models
– describe characteristics of random walk processes and contrast them to covariance stationary processes
– describe implications of unit roots for time-series analysis, explain when unit roots are likely to occur and how to test for them, and demonstrate how a time series with a unit root can be transformed so it can be analyzed with an AR model
– describe the steps of the unit root test for nonstationarity and explain the relation of the test to autoregressive time-series models
– explain how to test and correct for seasonality in a time-series model and calculate and interpret a forecasted value using an AR model with a seasonal lag
– explain autoregressive conditional heteroskedasticity (ARCH) and describe how ARCH models can be applied to predict the variance of a time series
– explain how time-series variables should be analyzed for nonstationarity and/or cointegration before use in a linear regression
– determine an appropriate time-series model to analyze a given investment problem and justify that choice
– You must be careful to understand seasonal effects, such as revenue increases over the holiday season, and changing variances over time.
– Challenges of Working with Time Series:
— Residual errors are correlated. When this happens with an autoregressive model, estimates of the regression parameters will be inconsistent.
— The mean or variance changes over time, which makes the output of an autoregressive model invalid.
– The regression model for a time series is written as: y_t = b0 + b1*t + e_t, where t = 1, 2, …, T.
– y_t: Value of the time series at time t (dependent variable).
– b0: The y-intercept term.
– b1: The slope coefficient.
– t: Time, the independent or explanatory variable.
– e_t: A random error term.
– The slope and intercept parameters in the regression equation can be estimated using ordinary least squares.
Example: Predicting a Value with Regression Equation
– The intercept and slope of a linear regression are b0 = 3 and b1 = 2.3.
– Calculate the predicted value of y after three periods.
Solution:
– y3 = 3 + 2.3(3) = 9.9.
– Linear trends may not correctly model the growth of a time series. Attempting to use a linear model can lead to persistent errors in the estimations with serial correlation of residuals (i.e., differences between the time series and the trend). Such cases call for the use of a method other than a linear model.
– A log-linear trend often works well for financial time series. Such a model assumes a constant growth rate in the dependent variable. It can be modeled in the following manner: y_t = e^(b0 + b1*t + e_t), where t = 1, 2, …, T.
– The exponential growth is at a constant rate of [e^(b1) - 1].
– This can be transformed into a linear model by taking the natural log of both sides of the equation: ln(y_t) = b0 + b1*t + e_t, where t = 1, 2, …, T.
– A linear trend model can then be used on the transformed equation to determine the parameters b0 and b1.
Example: Predicting a Value with Log-Linear Regression Equation
– The intercept and slope of a log-linear regression are b0 = 2.8 and b1 = 1.4.
– Calculate the predicted value of y after three periods.
Solution:
– ln(y3) = 2.8 + 1.4(3) = 7
– y3 = e^7 = 1,096.63.
The assumptions behind regression analysis must be satisfied in order for the results to be valid. With these models, a common violation is the correlation of regression errors across observations.
The Durbin-Watson statistic can be used to test for serial correlation in the model. When testing a model’s DW statistic, the null hypothesis is that no serial correlation is present.
A DW statistic that is significantly below 2 is strong evidence of serial positive correlation and a DW statistic above 2 indicates negative serial correlation.
If evidence of serial correlation is detected, the data may need to be transformed, or another estimation technique may need to be applied.