Quant 2.2 Flashcards
What is the goodness of fit? What is the measure of goodness of fit in a simple linear regression?
- As the name says, it means how well does that data fit into our regression model.
In a simple regression model (with only one dependent variable & one independent variable), the co-efficient of determination a.k.a. R-squared or R^2, is a measure of the goodness of fit of an estimated regression to the data.
What is R-squared?
- R-squared = (Sum of squares regression) / (Sum of squares total)
It can also be looked upon as (explained variation) / (Total variation)
The maximum value for R-squared can be 1. (The higher the better)
Y-hat is the predicted value of the dependent variable based on the regression model (equation).
Y-bar is the average of the actual Yi values which have been observed or gathered in the data.
Yi is the value of the dependent variable in the data.
What is explained variation & total variation?
- Explained variation is the sum of squared differences (variance) of the predicted value of Y by the regression model (Y-hat) from the average values of Y which have been observed (Y-bar).
- Total variation is the sum of squared differences of the actual observed Y values in our data and the average value of Y (Y-bar).
What is adjusted R-squared and why do we need it?
- As we add independent variables to the regression model (equation), R^2 increases even if the amount they explain is not statistically significant.
Thus as a solution to this problem, we have a concept of an adjusted r^2 (r^2 with a bar on top of r) (which most of the statistical software will report).
What is the formula of R-bar^2?
- R-bar^2 = 1 - [ (n-1) / (n-k-1)] * (1 - R^2)
here n = no. of observations & k = no. of independent variables.
What happens when we add a new independent variable to a regression model?
- When a new variable is added to the regression model, it (the regression software) assigns a slope co-efficient or a regression co-efficient to the new independent variable (X2 in a simple regression model). This b2 of the new X2, is the amount of change in Y with per unit of change in X2 (as we already know). While adding a new regression co-efficient to the new independent variable, the software also assigns a t-statistic to the co-efficient. This t-statistic has an impact on the value of R-bar^2 (adjusted R^2).
How does the t-statistic of the slope co-efficient of the new independent variable affect the adjusted R^2?
- Adjusted r^2 increases if the co-efficients’s t-statistic is > absolute |1| & vice versa.
Another important point to note is that the adjusted r^2 can take values lower than zero or negative values as well.
What does a high value of adjusted r^2 or r^2 tell us?
- It only tells us that how good does the regression model fit the data and nothing more. Imp point to remember is that it tells us nothing about how well specified is our regression model.
What are some limitations of the Adjusted r^2?
- There is no clear interpretation of adjusted r^2 in multiple linear regression.
- The r-bar^2 doesn’t address whether the regression co-efficients are significant or if the predictions are biased, which will require examining residual plots and looking at other statistics
- R^2 or r-bar^2 are generally not suitable for testing the significance of the model’s fit, it just shows how good the regression model fits the data.
So are there other statistical measures which can be used to measure the goodness of fit of the regression model?
- Yes, besides r-bar^2, there are a couple of other measures.
Akaike’s information criterion (AIC) & Bayesian information criteria (BIC).
Both AIC and BIC evaluate the quality of model fit “among competing models for the same dependent variable.” Lower values indicate a better model under either criteria.
When do we prefer one over the other?
- AIC when the goal is to have a better forecast or the model is used for prediction purposes. BIC when the goal is the best goodness of fit.
No need to remember the formulae, but important point to know is the variable k (no.of independent variables) is a penalty parameter in both criteria: higher values of k result in higher values of the criteria.
How do we run a hypothesis test for a single regression co-efficient?
- Just like a simple linear regression, we run a hypothesis test for each single co-efficient.
*t-statistic test is used (which will be given to us in the exam and we wouldn’t have to calculate it)
What are the steps to do a hypothesis test?
- There are 6 steps
a. State the hypothesis - Ho(null hypothesis): b1 = 0 vs H1: b1 not= 0
b. identify the appropriate test statistic - no idea which to use (had seen this in L1)
c. State the level of significance (confidence intervals)
d. State the decision rule - determine the critical test value which we’ll compare with the test statistic usually to reject the null, if test statistic > Critical Value
e. Calculate the test statistic
f. Make a decision
What is a Joint F-test?
- it’s a hypothesis test used to jointly test a subset of variables in a multiple regression.
It consists of two models unrestricted (which includes all the independent variables) and restricted/nested model (which includes only the independent variables which we want to test i.e. a subset)
The formula for the test statistic is as follows:
F = [(Sum of squares errors Restricted/nested model - Sum of squares error unrestricted)/q] / [(Sum of squares errors unrestricted model) / (n - k - 1)]
Here, we know what n is & also what k is and the (n-k-1) is the degrees of freedom.
q = no. of independent variables we have not taken into consideration from the total number of independent variables or the no. of independent variables that we are restricting.
What is a general linear F-test?
- It is an extension of the Joint-F test where we test the significance of not some restricted variables but all the independent variables or the whole regression model.
Ho (null) = all co-efficients = zero and Ha: At least one co-efficient not= zero.
Test statistic formula:
F = Mean regression sum of squares / mean squared errors = MSR / MSE