Quants Flashcards by Kunal Lalla

What is Regression Analysis?

A statistical process where we infer the influence of one/more(independent) variables on a single(dependent) variable or we predict a dependent variable (Criterion) based on other independent variables (predictors).

How well did you know this?

Not at all

Perfectly

Simple linear regression vs Multiple linear regression?

Simple is when we have one dependent variable and one independent variable and multiple linear regression is when we have a single dependent variable and two or more independent variables.

How well did you know this?

Not at all

Perfectly

What should be an Analyst’s focus?

The heavy computational work is done by statistical software like Excel, Python, R, etc.
An analyst should focus on:
A) Specifying the model correctly,
B) Interpreting the output of the software.

How well did you know this?

Not at all

Perfectly

Uses of Multiple Linear Regression?

A) to identify relationship between variables
B) to test existing theories
C) to forecast/predict a criterion

How well did you know this?

Not at all

Perfectly

What is the general form of Regression Equation? What is the intercept co-efficient and what are slope coefficients?

Yi = b0 + b1X1 + b2X2 + ei
b0 is the intercept co-efficient and it represents the expected value of Y(criterion) if all the predictors are zero.
b1, b2 etc. are partial/regression slope coefficients which measure how much the criterion changes when the independent variable changes by one unit, holding all other independent variables constant. We’ll always have k slope coefficients where k = number of independent variables.

How well did you know this?

Not at all

Perfectly

Assumptions under multiple linear regression?

There are 5 in total:
A) Linearity - The relationship between criterion and each of the predictors should be linear. (The regression line should fit through the entire data points graphs)
B) Homoskedasticity - The relation of criterion with the errors. (Criterion on the X-axis and errors on the Y-axis, where errors should be within a range)
C) Independence of Errors - The observations should be independent of one another. Regression residuals should be uncorrelated across observations.
D) Normality - The error terms should be normally distributed. (Deviations from the diagonal past +/-2 standard deviations indicate that the distribution is fat-tailed.
E) Independence of Independent variables - Independent variables are not random and there is no linear relationship between two or more of the independent variables or combination of the independent variables.

How well did you know this?

Not at all

Perfectly

What is the goodness of fit?

Goodness of fit shows us how well a particular regression model fits the given data.

How well did you know this?

Not at all

Perfectly

What is the simplest measure for goodness of fit?

R^2 or R-squared is the simplest measure to check/determine the goodness of fit.
In a simple regression model, R^2 or R-squared, the co-efficient of determination, is a measure of the goodness of fit of an estimated regression to the data.

How well did you know this?

Not at all

Perfectly

How do we calculate the co-efficient of determination?

R^2 = (Sum of squares regression)/(Sum of squares total) or (explained variation)/(total variation)
Numerator is total of [(Y-hat) - (Y-bar)]^2 / total of [(Yi) - (Y-bar)]^2 (where Y-hat is the predicted Y-value and Yi is the actual Y value and Y-bar is the average value of Y)
*notice the denominator isn’t based on the regression model.
The highest value of R^2 can be 1 and the lowest can be zero. (The higher the better)

How well did you know this?

Not at all

Perfectly

R^2 works well with simple regression model, but what is the problem with multiple linear regression?

As we add predictors to our model, R^2 increases even if the amount they explain is not statistically significant (has no explanatory power).
This leads to overfitting problem, which gives us an overly complex model.

How well did you know this?

Not at all

Perfectly

So, how do we estimate the goodness of fit for a multiple linear regression model?

We use adjusted R^2.

How well did you know this?

Not at all

Perfectly

How to calculate adjusted R^2?

Adjusted R^2 = 1 - [(n-1)/(n-k-1)] * (1 - R^2) where n = no. of observations and k = no. of predictors.
*(n-k-1) is the degrees of Freedom

How well did you know this?

Not at all

Perfectly

What happens to adjusted R^2 when we add new predictors in our regression model?

Adjusted R^2 increases if the coefficient’s t-statistic is > |1| and
Adjusted R^2 decreases if the coefficient’s t-statistic is < |1|

How well did you know this?

Not at all

Perfectly

Additional remarks about Adjusted R^2?

Adjusted R^2 can be negative (whereas R^2 has a lower bound of zero).
A high Adjusted R^2 means that the model is a good fit, but it doesn’t mean that the model is well specified (Means using all the right predictors and the predictors are in the correct form).

How well did you know this?

Not at all

Perfectly

What are the shortcomings of Adjusted R^2?

In multiple regression, there is no neat interpretation of adjusted R^2 (like R^2 in simple regression is explained variation/total variation)
Doesn’t indicate if the coefficients are significant or if the predictions are biased.
Also, it’s not generally suitable for testing the significance of the model’s fit (for which, we explore the ANOVA further, calculating the F-statistic and other goodness of fit metrics)

How well did you know this?

Not at all

Perfectly

What are the other metrics we use beyond adjusted R^2?

Study These Flashcards

A) AIC (preferred if the model is used for prediction purposes)
B) BIC (preferred when best goodness of fit is the goal)
*The lower the number for both, the better for the model.

How do we test if the coefficients in a regression model are significant or not?

Study These Flashcards

By Hypothesis testing! *The null will always assume that coefficient values are zero.
For any single co-efficient (like b0 or b1 or b2) -> the testing is same as we’ve done before in level 1.
T-stat will be calculated and given in the ANOVA table and we compare it with a critical value and reject the null (which says value of co-efficient is zero).

What is a Joint F-test?

Study These Flashcards

It is used to jointly test a subset of variables in multiple regression.

What are the two types of Joint F-test?

Study These Flashcards

Restricted model vs Unrestricted model -> Unrestricted has all the coefficients.
Restricted model/nested model sets two or more coefficients to zero (to see if adding them has any statistical value addition or not)
The number of coefficients we set to zero is called as ‘q’.

How is the F-statistic calculated for the restricted model?

Study These Flashcards

Using the formula:
F = {[(Sum of squares error restricted model) - (sum of squares error unrestricted)]/q} / [Sum of squares unrestricted model/(n-k-1)]

What is the General linear F-test?
How is the F-Stat calculated for the General linear F-test?

Study These Flashcards

It is an extension of the Joint F-test, where we test the significance of the whole regression equation.
Null is that all coefficients are zero!
alternative is that at least one is not equal to zero.
Here,
*F-stat = Mean regression sum of squares / Mean squared error = MSR/MSE

Imp points about F test?

Study These Flashcards

We reject the null if the F-stat value exceeds the given critical value.
All the F-test are one-tailed whereas T-statistic test of any one co-efficient is a two-tailed test.

What is Forecasting using multiple regression?

Study These Flashcards

It’s using estimates for coefficients and assumed values for predictors and coming up with a regression model.
Basically, a regression model based on the data which is estimated and predictors are assumed.

What are the principles we need to adhere for specifying a model?

Study These Flashcards

A) Model should be grounded in economic reasoning (Choice of Variables should have economic reasoning)
B) Model should be parsimonious (Each variable chosen should play an essential role or we should have as few X variables as possible)
C) Model should perform well out of sample
D) Model function should be appropriate (If nonlinear relationship of predictors, then use appropriate nonlinear terms)
E) Model should satisfy regression assumptions

Reasons for Failures in Regression Functional Form?

A) Omitted Variables B) Inappropriate form of variables (Ignoring a nonlinear relationship between Criterion and a Predictor) C) Inappropriate variable scaling (One or more variables may need to be transformed before regression for example, using common-sized balance sheets instead of actual numbers from financial statements when comparing different companies) D) Inappropriate data pooling (using data which shouldn't be used)

How do we test for Heteroskedasticity?

We use the Breusch-Pagan (BP) Test.

What are the steps of a BP test?

A) Null = There is no Heteroskedasticity B) we run a normal regression and find residuals C) Then we regress the residuals we found with X D) Then we calculate BP test stat as = n * R^2 E) Compare with the critical value. If test stat > Critical value, then we reject the null & Vice versa.

What do we need to know about P-value from level 1?

A higher t-stat corresponds to a lower p-value. We compare this p-value to the level of significance in order to decide if we have to reject the null or not. If the p-value is less than the level of significance, we reject the null hypothesis.

How do we correct conditional Heteroskedasticity?

By computing Robust standard errors (a.k.a. heteroskedasticity-consistent standard errors or White-corrected standard errors) *T-stat comes down and p-value increases. Why? Cos SE is inflated.

What are the consequences of serial correlation (autocorrelation)?

If the predictor is a lagged version of the criterion -> Then coefficient estimates and SE estimates both are Invalid. If the predictor is not a lagged version of the criterion -> Then only SE estimates are Invalid.

How do we test for Autocorrelation?

We use the Breusch-Godfrey (BG) test and Durbin-Watson (DW) test. DW test only tests for first-order serial correlation. *BG test can be used to test if there is no serial correlation in the model's residuals up to lag P. Just like Heteroskedasticity, Null is that there is no serial correlation and alternate is that there is SC.

How is Serialcorrelation corrected?

By Robust standard errors.

What are the consequences of Multicollinearity?

The Standard error is inflated* (Unlike Heteroskedasticity and SC where the Standard error was undervalued) Thus, if SE is high, T-stat is small and we fail to reject the null.

What are the classic symptoms for multicollinearity?

High R^2 and significant F-stat t-stat for slope coefficients aren't significant.

How do we test for multicollinearity?

Variance Inflation Factor (VIF) is used to quantify/test for multicollinearity issues. A VIF exists for each independent variable. Basically, we regress each predictor against the remaining predictors and from that regression, we get R^2. Using that R^2, VIF = 1/(1 - R^2).

Imp points about VIF?

VIF = 1 is the best possible case and indicates there is no multicollinearity. VIF from 1-5 is fine, from 5-10 requires further investigation and above 10 is a problem.

How do we correct for multicollinearity?

A) Excluding one or more of the predictors B) Using a different proxy for one of the predictors C) Increasing the sample size

Quants Flashcards

(37 cards)