Regression Flashcards
Requirements for Linear Regression
- SRS
- Pairs of (x, y) data have a bivariate normal distribution (For each value X, the corresponding Y values have a normal dist; can be confirmed by examination of a scatterplot and double-checking of outliers
- Homoscedasticity of residuals (equal variance)
LR is a good model if:
- regression line of scatterplot appears to fit the points well
- r indicates a linear correlation
- High: R-squared/adj R-squared/F-stat
- Low: Std Error/t-statistic/AIC/BIC/MAPE/MSE
* If not a good model, the best predicted value of y is the mean
Goal of linear regression
Find line that minimizes the sum of squares of residual values
Why use a residual plot?
Scatter plot with residuals as y values; Used to assess correlation and regression results; Randomness in the distribution of the plot is what we want; any patterns or changing of “thickness” of distribution suggests an underlying, non-linear pattern
Regression process
- construct a histogram to initially gauge normality
- construct a scatterplot + quantile plot and verify that there is a linear pattern
- construct a residual plot and verify that there is no pattern
Prediction interval
Confidence interval for variables (instead of population parameters)
Total deviation of (x, y)
vertical distance y minus y-bar, which measures the distance between the the point (y) and the sample mean (y-bar)
Explained deviation of (x, y)
vertical distance y-hat minus y-bar, which measures the distance from the predicted value and the sample mean
Unexplained deviation of (x, y)
vertical distance of y minus y-hat, which is the vertical distance between the point (x, y) and the regression line
coefficient of determination (r-squared)
proportion of the variation in the response variable that has been explained by the model; R2= 1 - explained variation / total variation
correlation coefficient (r)
explains strength and direction of correlation
adjusted r-squared
as you add more X variables to your model, the R-squared value will always be greater since new variables can only add to total amount of explained variation; adjusted R squares penalizes
Standard Error
Absolute measure of the average distance that points fall from regression line; measure of goodness of fit; = Sqrt(MSE) = Sqrt [SSE/(n - q)] *q = # of coefficients in model
F-statistic
measure of goodness of fit; MSR = sigma (pred-mean)/ (q - 1)
AIC
Akaike’s Information Criterion; measures goodness of fit of an estimated statistical model and can be used for model selection; lower is better