Chapter 8 Linear Regression Flashcards
Linear regression
Statistical method for fitting a line to data where the relationship between two variables can be modeled by a straight line with some error.
Y=B0 + B1x + e
E= error & is usually dropped when writing the formula
Only used for linear trends. Nonlinear trends (bell curves) use other methods
Residuals
Residuals are leftover variation in the data after accounting for the model fit.
Data = fit plus residual
Each observation will have a residual.
Residual for a value x in the data set = y- y^
Positive residuals means the predicted value is lower than the observed, so is under est mated
Residual plots
Plots of the residuals the regression line is displayed by a straight horizontal line & the residuals are plotted at their original horizontal locations, with the vertical coordinate as the residual
Residual plots identify characteristics or patterns still apparent in the date after fitting a model. Like if the fit is better at one end of the line but not the other
Correlation
Describes the strength of a linear relationship. Always takes values between - 1& 1. Denoted as R.
R= 1/ (n-1) * mean (x- mean x/ sx) *( y- mean y/ sy)
Negative number means negative trend.
Stronger the trend, the closer to 1. If there is no apparent correlation, it will be closer to zero.
Note-sometimes nonlinear trends sometimes produce correlations that don’t reflect their strength
Least squares regression
Line that minimizes the sum of the squared residuals. The conditions for a least square line generally require:
-Linearity
- Nearly normal residuals ( some outlier is far from the line)
- constant variability (variability is roughly consistent- variability of y should not be higher when X is larger)
- independent obstructions - be cautious applying regression to time series data as there may be underlying structures that should be considered
LSR =b0 + b1x
b1= Sy/Sx * R (slope)
b0 = mean y - b1mean x (point-slope)
(sy & sx are the sample std deviations, r is correlation)
R squared
Correlation squared - Describing strength of fit.
Measures the fraction of the variation in y that is explained by the regression y^ = Bo + B1x. Y^ can be replaced by y
Rule of thumb is to use R2 not R to comment on the strength of an association
R2=.49 means repression explains about half of the variation in y
Leverage
Points that fall horizontally away from the center of the cloud tend to pull harder on the line so they the called points with high leverage
If one of these points doo appear to invoke its influence on the point, its called on influential point
What happens if we switch explanatory and response variables
The correlation, r, stays the same but the regression line doesn’t
If we switched them, the line would make the residuals horizontal distance the smallest, rather than the vertical
Extrapolation
Extrapolation means using a regression line to predict y values of x outside of the range of data. Don’t do it
In regression analysis, What 3 numbers are connected with every individual
X= the value of the explanatory variable
Y= the observed valve of the response variable
Y^ = the predicted value of the response variable
Y and y^ are almost never equal. The difference y- y^ is the prediction error (residual)