UNIT 2 - REGRESSION Flashcards
How do you describe a scatterplot?
DIRECTION
FORM
STRENGTH
and STRANGE
describe a scatterplot’s strength?
give the r value (if straight),
or say…
“tightly packed… loosely packed”
how do you describe direction?
positive or negative
how do you describe form of a scatterplot?
straight or curved?
What is wrong with “for each additional hour studied, a person’s test score will go up by five points?”
This implies CAUSATION. this is just a correlation. You should say “on average, students with an additional hour of study time tend to score five points higher” These are all different students, and we don’t know if it CAUSED it. We can only show causation with an EXPERIMENT. This is a study. If students were randomly assigned hours of study time, then you could discuss causation because that would be an experiment.
Diff between association or correlation?
association is talking about a relationship.
If you see a pattern in the scatterplot, there is an association.
Correlation is an actual calculated number, r, between two quantitative variables.
Why is it called the “least squares regression line?”
the LSRL?
Because, after you find the mean-mean point, you fix the line so that it minimizes the squared vertical distancesto that line from each point.
It minimizes the squared residuals, the least squares….
How do you find outliers in regression?
they don’t follow the “flow”
(pinky trick, cover with you pinky.. Then uncover.. Does it follow the flow?)
What is homoscedasticity?
equal scatter along the regression line
What values can r be?
from -1 to +1
(r near 0 is WEAK)
What is the line that you plot?
IT IS A MODEL!
It is the LSRL and it is the model we are talking about
what is a linear model?
It is an equation you can use or a line of a graph,
but it is just a model that says what kind of happens,
and can be used to ESTIMATE WHAT MIGHT HAPPEN
What does r2 tell us?
(r-squared)
It tells us the percent of variablility of y that is explained by the model with x.
If study time vs test score equation is
predicted score = 40 + 15 (study time).
How would you interpret the slope?
The model finds that on average, for each additional hour of study time a student has, they tend to score about 15 points more.
If study time vs test score equation is
predicted score = 40 + 15 (study time).
How would you interpret y intercept?
The model predicts that a person who studies 0 hours would score around 40 points.
If a linear association between study time and test score has an
r2 =0.85,
How do you interpret this?
( r2 is a.k.a the coefficient of determination)
85% of the variability in test score can be explained by study time with the model.
What if a scatterplot goes straight across horizontally?
NO ASSOCIATION.
That would be like height and IQ, they are independent so each height has about the same IQ.
What is the “coefficient of determination?”
A fancy name for r2
Does r2 tell direction?
NO
r2 is always positive, so you can’t use it to see if the relationship is negative.
Can there be a correlation between grade and music preference?
No, music preference is categorical.
There is an association, however.
Does the regression line (LSRL) go through a lot of points?
No, usually it goes through NONE!
It just goes through the center of the cloud of points.
If r= -0.9 is there a strong, negative linear relationship?
Maybe not.
CHECK THE SCATTER. One outlier or typo can make the r value look STRONG.
what is the LSRL
the “least squares regression line”
that line you plot
OR
That equation
What does r tell us?
(r is a.k.a the correlation coefficient)
The direction (+/-) and how strong a LINEAR relationship is between two QUANTITATIVE variables… (when linear)
What is the “correlation coefficient?”
The r value
which is response?
y variable,
the Vertical axis..
It “responds” to the x
Lurking variable: Why are there more ice cream sales on days that there are more surfing accidents? Is the ice cream putting surfers at risk? are people buying ice cream because they got hurt?
The WEATHER is the lurking variable.
When it is a nice day, more surfers and more ice creams are sold.
So, the WEATHER causes both to go up and down together.
Give example of incorrectly using the word “correlation”
“there is a correlation between gender and video game playing”
This person should say “association.”
You can’t say correlation because gender is categorical.
What’s wrong: Age and height have a correlation of 2.7
WRONG.
Correlation must be between 1 and -1
If r2= 0.99 is there a strong positive association?
It could be negative. r2 does not tell us the direction because any number squared is positive.
What is tricky about a lot of scatterplots you see on the AP test?
They are often the residuals plots, and not the actual data. Watch out!
Read the the diagrams carefully.
What should we look for in resid plot?
Curve or pattern- if you see this you need a better model.
Also, it should have equalish scatter from left to right
It should look RANDOM
What if the scatterplot is curved?
Either straighten the scatter and fit a line,
or keep it and fit a curve
Try quadreg, cubicreg, lnreg, logreg and check the graph and the r.
What is extrapolation?
Making predictions outside of the x values you have.
does correlation mean causation?
NO WAY DUDE
Just because variables go up and down together doesn’t mean one cause the other.
What’s up with extrapolation? Is it OK?
Not ideal. Sometimes it’s all you can do, but state CAUTION.
If something is associated is it correlated?
Not necessarily.
It can be associated and have a zero correlation
( parabolic scatterplot)
or categorical variables.
Will residual plots always show outliers?
(will outliers always have large residuals?)
Usually, but not always. Some points have so much leverage, they pull the line up to it…
How can you check if the scatterplot is “straight enough?” for a linear model?
Residuals plot fool!
check the resids
Give example of correlation without causation and explain the lurking variable.
Ski accidents are higher on days with more hot chocolate sales, therefore, hot chocolate must cause ske accidents. (lurking variable: the number of people on the mountain). What is happening is that on days when the mountain is crowded, there are more hot chocolate sales and more ski accidents. So the population on the mountain is causing both to rise and fall together.
How do you make a residuals plot? (find RESID?)
stat>plot make a scatterplot, but instead of L1 vs L2, change L2 by putting cursor on it and going to 2nd>lists down to RESID.
You can plot L1 vs RESID
or you can plot L2 vs RESID
What are some strong r values and some weak r values
Strong r values are close to 1 or -1, like -0.83 or 0.94. Weak r values are close to zero like 0.10 or -0.06
What point is on every regression line?
the mean-mean point. (x bar, y bar).
This point is generally not one of the points on the scatterplot.
Usually none of the scatterplot points are on the regression line.
Which is explanatory variable?
the x
the horizontal axis.
it “explains” what happens to y
What do we want to see in a residuals plot in order to continue with the current model?
random scatter. No pattern.
if there is a pattern, then find a new model or proceed with caution.
What is a residual?
Vertical distance to the LSRL (to the model)
ACTUAL-PREDICTED,
A-P, like this class AP (get it?)
y - yhat
Take y data found and from that, subtract the y you get from plugging the x into the model (equation).
If something is correlated is it associated?
Yes.
If it is correlated then it must be associated.
However, if it is associated, it may not be correlated.
is r sensitive to outliers?
yes. A single outlier can make it seem like there is a relationship ( if way out in x direction), or even seem like there is no relationship.
what is leverage?
Far right or left from the middle.
leverage just means it is far away from x-bar
Some leverage points are not influential if they go along with the flow of the scatter.
Interpret residual:
Points below the line
(negative residual)
“the model overpredicted”
or
“Actual value was below the the expected (or predicted)”
Interpret residual:
Points above the line
(positive residual)
“the model underpredicted” or “actual performance was above the expected performance
If r= 0.8.
An x value that is 2 standard deviations above the mean in the x direction will have a predicted y value that is _______
1.6 standard deviations above the mean in the Y direction
Does high r squared mean a good model?
CHECK SCATTER FIRST..
Make sure model “FITS” the data.
You should check your scatterplot and residuals plot to make sure model is appropriate and no outliers present… then it means something
So YES, but after you check the resids.
How do you interpret slope?
For an increase of 1 [unit of x] there is an (increase/decrease) of [SLOPE] [units of y].
You can write “SLOPE UNITS Y/ ONE UNITS X” to help
How do you interpret slope EQUATION?
rSy/Sx
for each increase of 1 st dev in x direction,
you go r st dev in y direction.
2st dev in x, you go 2r st. dev in y.
3st dev in x, you go 3r st. dev in y.
what does influential mean?
It impacts the SLOPE.
Influential points with leverage.
It means that the point, when added or removed to data, will influence the SLOPE.
Generally these are outliers in the x direction. Far left or right.
if you switch x and y does r change?
NO. The strength stays the same.
Can you predict an X by using a Y?
NOT WITH THE SAME EQUATION!
BE CAREFUL!! Don’t just solve for x…
You have to change the entire equation and start from scratch.
(run LinReg L2, L1)
Interpret r squared
r squared % of variability in y can be explained by the model with x. The rest is in residuals…
If there is a crazy outlier, what can you do?
Run the analysis with and without the outlier and write about both.
how do you interpret y intercept?
The model predicts that if there were no [x stuff] this is how much [y stuff] you’d have
First step in interpreting slope
Write “slope units y over 1 unit x” and look at it.
Then say “for each unit of x there is a change of “slope” units of y”
How do you get equation from computer output?
variable coeff indep: age
constant 7.2
Height 3.5
For this case:
predicted age= 7.2 + 3.5 (height)
Under “coeff” go down and left
If you switch x and y will slope change?
YES (but not just reciprocal)
Height and weight has an r value of 0.7. You would expect a person with a height that is 2 st. dev above the mean in height to have a weight that is only___St. Dev above the mean weight.
only 1.4 S.D above the mean for weight.
(for each SD in the x direction you change r SD in the y direction)
Computer ouput:
What does “constant” mean?
It is the y intercept
Computer Output:
What is “S”
The average, or typical residual..
Standard deviation of the residuals
typical distance from actual value to the model’s prediction.
About how far off your prediction is likely to be.
How can you straighten data?
Do stuff to the y (square it, root it, log it, etc) and recheck the plot. Remember to put the transformation into your equation.
Example
Sqrt y = 4.33 - 2.03 x
if you mult or divide the x’s or y’s (shift/scale) does r change?
no. the strength remains the same. (If you log or square it, it will change, but just adding or multiplying won’t change it)
What other regressions does your calculator do?
Quadreg, cubicreg, lnreg, etc.
just be careful when substituting while writing the equation given.
How do you get equation from computer output?
variable coeff indep: doc
constant 0.005
genet - 0.233
predicted doc = 0.005 - 0.233 (genet)