Linear regression Flashcards
SSR, MSR, R SQ
How to quantify the quality of a model and its predictions?
By calculating Sum of Squared Residuals.
How do you calculate the sum of squared residuals?
- First calculate the residuals by finding the differences between observed and predicted values.
- Then square the residuals and sum up the squared residuals.
Sum of squared residuals formula
SSR = Sigma(observed - predicted) ** 2
What kind of models can we apply SSR
All kinds of models - linear or curve
How do we calculate the residuals - vertical or perpendicular distance to the model
By calculating the vertical distance
Perpendicular distance to the model is also called as
Shortest distance.
Why do we use vertical distance instead of the shortest distance
Since the perpendicular or the shortest distance doesn’t give the correct values on x
What is the problem of SSR?
SSR is not easy to interpret since it depends on the amount of data we have. For example - For three data points, SSR is 14. For 5 data points, SSR is 22. It doesn’t imply that the second model is worst than first. Higher the data, worse the result. It only tells us that the model with more data has more residuals.
Should SSR be low or high
The smaller the value of SSR, the better the model fits the data. If SSR is zero, the model fits perfectly to the data.
How to compare two models that may fit to different sized datasets is to calculate
Mean Squared Error
Formula for Mean squared error
SSR/number of observations
sigma(observed - predicted) ** 2/ n
What does MSE calculate intutively
Average of residual, so MSE is present than SSR which increases when we add more data.
Why are MSEs difficult to interpret
When comparing two models, the values depend on the scale that is used in the models. One model using mm has MSE 4.7 while the other model using meters has MSE 0.0000047
How to overcome the disadvantage of MSE
Using R squared
How R squared overcomes the issue with MSE
R squared is independent of both size of the dataset and scale
How is R squared calcualted?
- R squared is calculated by comparing the SSR/MSE around the mean y-axis value. Compare this to SSR/MSE around the model we are interested in. Therefore R squared gives the percentage of how much the predictions improved by using the model instead of just mean.
What is the range of R squared values
0 to 1
When R squared is closer to one it means
The model fits the data better than using the mean y-axis value.
R squared formula
SSR(mean) - SSR(fitted_line)/SSR(mean)
SSR(mean) - SSR(fitted_line) - what does it mean
Tells us the percentage the residuals around the mean shrank when we used the fitted line.
Rsquare = 1 means
Fitted line fits data perfectly
Rsquare = 0 means
SSR(mean) = SSR(fitted_line) - they are both equally good or bad
SSR(fiited_line) = 0 mean
Fitted line fits data perfectly
In what scenarios does Rsquared results have low confidence
Small amount of data can have high (close to 1) R squared. Anytime we see trend in a small dataset, it is difficult to have confidence that a high R squared value is not due to random chance.
When does R squared result have high confidence
When there is large amount of data.
Is intuition only way to have confidence in R squared results?
No, having large data intuition is not enough. So, statisticians developed p-values.
R squared formula using MSE
MSE(mean) - MSE(fitted_line)/MSE(mean)
Does R squared always compare the mean to a straight fitted line?
The most common way to calculate R squared is to compare mean to a fitted line. We can calculate R squared to compare square wave to sine wave.