Linear Regression (L9 & 10) Flashcards
what are the 2 types of linear regression
simple linear regression:
outcome <– predictor
multiple linear regression :
outcome: <– 3+ predictors
correlation
- Only provides the direction and magnitude (effect) of a
relationship: says nothing about causation - The x and y variables are interchangeable
correlation vs regression:
correlation:
Y (Variable 1)<—> X (variable 2)
Regression:
Y (outcome) X (predictor)
Linear regression takes us a step beyond correlation, and a step closer to causation.
It does this by allowing us to predict an ‘outcome’ variable by knowing a ‘predictor’ variable.
what formula do almost all statistical tests follow:
Outcome = model + error
outcome: The thing you want to know
(information, prediction etc.,)
model: formula u will use to find it
error: how much ur “off”
Example Correlation: what is the
relationship between 10 rep max back
squat and 1 rep max?
Example Regression:
Can I predict
someone’s max squat by knowing their
10 rep max?
Example Correlation: what is the
relationship between number of hours
studied and exam performance?
Example Regression:
Can I predict a
person’s test score from the number of
hours they studied?
how to find the line that best fits the data?
W E U S E T H E ‘ O R D I N A R Y L E A S T S Q U A R E S ’ M E T HO D
*This is how we find the line that
best fits the data.
*The line that goes through, or
as close to, as many of the data
points as possible.
*It’s called least squares because
we are summing the squared
distance!
outcome=model + error
Yi = (b0 + b1Xi) + ∈
Yi= outcome
E=error basically the score we
predict minus the actual score
b=model
Model intercept (b0) plus the slope (b1) * X (ith persons
score on the predictor variable)
Note: b0 + b1 are typically referred to as parameters or
regression coefficients.
A note here, “i” is just a placeholder for whichever
variable you are dealing with.
B0 = Intercept (where is y when x = 0)
B1 = Slope (negative relationship)
The line that goes through, or as
close to, as many of the data points
as possible.
In regression, these differences between the predicted
and actual values, or ∈, are referred to as
RESIDUALS.
H O W D O W E C H E C K
A M O D E L ?
*Just because we found the best
possible line for our data, doesn’t
mean that the line does a great job
of fitting our data, and therefore of
making predictions.
*We have to check!
*We compare against the mean of
the outcome (dependent) variable
*We look at the difference between the
observed values and the mean of Y.
*Remember, we square each value and
then sum!
H O W G O O D I S O U R
R E G R E S S I O N L I N E ?
C A L C U L A T I N G T H E S S R
*This is called the SSR, or the sum of squares
of the residuals (∈from our line equation).
*We find the difference between the actual
data and the regression line.
*The degree of inaccuracy when our best
model is fitted to the data.
T H E I M P O R T A N T S T E P :
I S O U R M O D E L B E T T E R
T H A N T H E M E A N ?
*If the SSMis LARGE, then our model is much better than
the mean, it has reduced the error (residuals) drastically,
and therefore improved prediction!
*If the SSMis SMALL, then our model is no better than the mean
SSM= SST - SSR
when comparing groups of 2 continuous variables,
We can have different
groups of people exposed
to different conditions.
We can have the same group of people
measured at different times, and possibly
exposed to different conditions.
We can also have more than two groups, in
either of the conditions mentioned above
(different or same people).
WE CAN USE LINEAR REGRESSION TO DO THIS
the coefficient(slope) is
the difference between groups