19 - Fitting Lines to Data Flashcards
Least Squares Line
the “best line” minimizes the sum of the squares of the vertical distances from the points to the line
Least Squares estimates of the slope and intercept
fitted values written as y^, using the line y^ = b0 + b1x
residual (e) = the difference, y - y^
least squares estimates:
b1 = r * (sy)/(sx)
b0 = ybar - b1*xbar
(r = correlation between y & x)
fitted model
slope
intercept
fitted model → y^ = b0 + b1*x
slope → b1
*understand the units on b1. They are the units of y over the units of x*
intercept → b0
*has units of y*
Residual
e; the vertical distance from the point to the least squares line
always look at a plot of e against x → the residual plot
*the residuals should have no structure at all, should look like a random swarm of points*
numerical summaries of the residuals
- sample mean of the residuals always = 0
- sample standard deviation of the residuals:
se = sqr [(e21 +…+ e2n)/(n-2)]
- (n-2) → bc we have estimated both slope and intercept in the regression
- se → measures unexplained variation in y
- low values of se are good
- se → measures unexplained variation in y
- (n-2) → bc we have estimated both slope and intercept in the regression
Root Mean Squared Error (RMSE)
se
Data = Signam + Noise paradigm
y = y^ + e
the model splits the observed data, y, into 2 parts: a systematic part, y^, and a random component, e
R2
(r)2 = sample correlation squared
the proportion of variability in y explained by the regression model
- 0 <= R2 <= 1
- R2 = 1 → perfect linear association
- R2 = 0 → no linear association
- R2 has no measurement units
- we prefer models with a higher R2
R2 ≈ ____
R2 ≈ 1 - (se2/sy2)
- if the variance of the residuals is small compared to the variance of the raw data y, then that is good, we have explained a lot of variation in y by using the model
Spurious association
driven by an omitted variables
Regression only identifies _____ and not _______
association, not causation