r squared and regression to the mean Flashcards
the amount of variability explained by r squared
the proof of variability explained is the square of the correlation coefficient
r squared is the proportion of the variability in the y-values explained by the line of best fit
regression to the mean: another interpretation of r
regression is moving closer to the mean
given x, the line of best fit predicted a value of y which we call y hat
r=0 no association between x and y
r=1 there is no regression to the mean
multiple linear regression
there are multiple potential explanatory variables for a response variable
linear regression for two explanatory variables
by principle of least squares, line of best fit has values of a, ß1 and ß2 that minimise the sum of squared residuals
MLR and matrices
do not need to be able to do by hand
MLR in excel
data, data analysis, regression
plot a residual plot for each explanatory variable
randomly distributed residuals indicate a good fit
U or upside down U are bad fit
usually try to choose the simplest model (with fewest explanatory variables)
- compromise between goodness of fit (small sum of residuals) and complexity (no of parameters)
AIC (Akaike information criterion)
balance complexity with goodness of fit
AIC = n ln(SSE/n) + s(p+1)
n is no of data points
SSE is the sum of squared residuals (errors)
p is number of explanatory variables
a small AIC is desired
related types of regression
- MLR can fit polynomials to data by treating powers as additional explanatory variables
- MLR can consider interactions between variables - one explanatory variable changes the effect that the second explanatory variable has on the response