Quants Flashcards
What is Regression Analysis?
A statistical process where we infer the influence of one/more(independent) variables on a single(dependent) variable or we predict a dependent variable (Criterion) based on other independent variables (predictors).
Simple linear regression vs Multiple linear regression?
Simple is when we have one dependent variable and one independent variable and multiple linear regression is when we have a single dependent variable and two or more independent variables.
What should be an Analyst’s focus?
The heavy computational work is done by statistical software like Excel, Python, R, etc.
An analyst should focus on:
A) Specifying the model correctly,
B) Interpreting the output of the software.
Uses of Multiple Linear Regression?
A) to identify relationship between variables
B) to test existing theories
C) to forecast/predict a criterion
What is the general form of Regression Equation? What is the intercept co-efficient and what are slope coefficients?
Yi = b0 + b1X1 + b2X2 + ei
b0 is the intercept co-efficient and it represents the expected value of Y(criterion) if all the predictors are zero.
b1, b2 etc. are partial/regression slope coefficients which measure how much the criterion changes when the independent variable changes by one unit, holding all other independent variables constant. We’ll always have k slope coefficients where k = number of independent variables.
Assumptions under multiple linear regression?
There are 5 in total:
A) Linearity - The relationship between criterion and each of the predictors should be linear. (The regression line should fit through the entire data points graphs)
B) Homoskedasticity - The relation of criterion with the errors. (Criterion on the X-axis and errors on the Y-axis, where errors should be within a range)
C) Independence of Errors - The observations should be independent of one another. Regression residuals should be uncorrelated across observations.
D) Normality - The error terms should be normally distributed. (Deviations from the diagonal past +/-2 standard deviations indicate that the distribution is fat-tailed.
E) Independence of Independent variables - Independent variables are not random and there is no linear relationship between two or more of the independent variables or combination of the independent variables.
What is the goodness of fit?
Goodness of fit shows us how well a particular regression model fits the given data.
What is the simplest measure for goodness of fit?
R^2 or R-squared is the simplest measure to check/determine the goodness of fit.
In a simple regression model, R^2 or R-squared, the co-efficient of determination, is a measure of the goodness of fit of an estimated regression to the data.
How do we calculate the co-efficient of determination?
R^2 = (Sum of squares regression)/(Sum of squares total) or (explained variation)/(total variation)
Numerator is total of [(Y-hat) - (Y-bar)]^2 / total of [(Yi) - (Y-bar)]^2 (where Y-hat is the predicted Y-value and Yi is the actual Y value and Y-bar is the average value of Y)
*notice the denominator isn’t based on the regression model.
The highest value of R^2 can be 1 and the lowest can be zero. (The higher the better)
R^2 works well with simple regression model, but what is the problem with multiple linear regression?
As we add predictors to our model, R^2 increases even if the amount they explain is not statistically significant (has no explanatory power).
This leads to overfitting problem, which gives us an overly complex model.
So, how do we estimate the goodness of fit for a multiple linear regression model?
We use adjusted R^2.
How to calculate adjusted R^2?
Adjusted R^2 = 1 - [(n-1)/(n-k-1)] * (1 - R^2) where n = no. of observations and k = no. of predictors.
*(n-k-1) is the degrees of Freedom
What happens to adjusted R^2 when we add new predictors in our regression model?
Adjusted R^2 increases if the coefficient’s t-statistic is > |1| and
Adjusted R^2 decreases if the coefficient’s t-statistic is < |1|
Additional remarks about Adjusted R^2?
Adjusted R^2 can be negative (whereas R^2 has a lower bound of zero).
A high Adjusted R^2 means that the model is a good fit, but it doesn’t mean that the model is well specified (Means using all the right predictors and the predictors are in the correct form).
What are the shortcomings of Adjusted R^2?
In multiple regression, there is no neat interpretation of adjusted R^2 (like R^2 in simple regression is explained variation/total variation)
Doesn’t indicate if the coefficients are significant or if the predictions are biased.
Also, it’s not generally suitable for testing the significance of the model’s fit (for which, we explore the ANOVA further, calculating the F-statistic and other goodness of fit metrics)