lecture 19 - regression - straight lines (and beyond) Flashcards
beyond correlation
- The lectures on correlation examined how to ask “is variable 1 related to variable 2?”.
- Does exam grade relate to the amount of practice?
- Are height and weight related?
- Correlation is great as far as it goes, but if there is a relationship between variable 1 and variable 2 – then we may well want to know WHAT that relationship is.
- And once we know that, we can use it for predictions.
- What will my exam grade be if I do 20hrs practice? What will it be if I do 40hrs practice?
Regression is the answer to these sorts of questions.
- What will my exam grade be if I do 20hrs practice? What will it be if I do 40hrs practice?
Correlation
how “good” is the best fit line?
Regression
what is the best fit line?”
Clearly stronger correlation
would be reflected in larger value for “r” (i.e. correlations different).
BUT – best fit line is the same. Regression will give the same line for both.
Equation of a line
- The equation for a line relating “x” and “y” is: y = mx + c
- Or, y = b0 + b1x (or indeed many other notations).
- y = mx + c is the “typical” notation when linear algebra is taught at school; y = b0 + b1x is often the notation used in the context of regression (more on this later).
- The key components are:
- y – the thing being “predicted” (i.e. the “outcome”).
- x – the thing doing the “prediction” (i.e. the “predictor”).
- c (or b0) – the “constant” or “intercept”.
- That is, the predicted value of y when x = 0.
- m (or b1) – the “slope” of the line
That is, the measure of how much y changes with a change in x.
change in y / change in x
Illustrating intercept and slope
c (or b0) – the intercept – can in principle be any number.
m (or b1) – the slope – can also be any number.
Positive slopes mean y and x are positively related (as x gets bigger, so does y); negative slopes mean they are negatively related (as x gets bigger y gets smaller).
Regression and straight lines
- In the context of straight lines, regression (like correlation) requires data in “pairs”: one as the predictor and other as the outcome.
- Note, can actually do regression with more than one predictor, or with “curved” lines. But will get there later (very briefly).
- For any such set of paired data can draw “a” best fit line.
- It may not actually be a v good fit at all (i.e. when the correlation is weak, then the best fit line is not actually a good fit) – but regardless, there will still be a best fit line (or perhaps a “least bad” fit line).
- Regression is the method to work out what that line is.
That is, working out what the slope and intercept are (also known as the “coefficients” of the line).
What is regression doing?
- Plotting x (horizontal axis) vs y (vertical axis).
- Also shows regression line relating x to y, and the “error” for each point.
The maths (which is not part of this course) behind regression will work out the line that minimises these errors (also known as “residuals”).
So actually, the joke in the last slide was true – regression really is about finding the “least bad” fit line (i.e. the line with the smallest total error).
regression minimises sum of squares residual
More of what regression is doing
- Previous slide showed “error” between points and regression line. The sum of these squared errors = SSR (aka Sum of Squares – residuals). the error between the data points and the regression line.
Can also work out SST (aka Total Sum of Squares).
That is, the errors between the observed data and the mean of y (so essentially the total error if not using x as a predictor at all). best predictor of y with no other info - And finally, can work out the SSM (aka Sum of Squares – Model; or Sum of Squares - Regression). That is, the error between the mean of y and the regression line.
Why? Because all these Sums of Squares are used in working out what the regression line is, and how “good” it is.
Again, don’t need to know the maths, but it is helpful to understand some of what is going on behind the scenes
example in notes
SPSS output
images in notes
- SPSS first gives correlation – here clearly a significant positive correlation (consistent with impression from scatter plot).
Gives the 1-tailed significance value (not sure why…). - Next, “model summary”: key parts are repeat of correlation coefficient (R), and R squared (proportion of variance explained by the “model”). total amount of variance explained.
- As an aside R2 = SSM/SST
- R2 is also the square of the correlation coefficient.
The “model” is the regression line – so R squared is the proportion of the variance explained by the line.
- Yes, regression does produce an F-ratio like ANOVA (and “significance” is, as usual, the probability of the observed data given the assumption of the null hypothesis – here that there is no “line”).
F ratio is Mean Square for model (i.e. regression line) over the Mean Square for the residuals (i.e. for the errors between data points and the regression line).
the actual line
* Finally – SPSS gets to the “coefficients” (i.e. slope & intercept of the regression line). SPSS reports these as “Beta”s (hence the b0 and b1 notation from earlier).
* Constant = intercept. Next is the slope - here called “practice” because that is the predictor variable.
Also assesses the significance of the coefficients individually (note, these can differ from the significance of the overall model – especially when move beyond a single predictor).
the slope is not the constant so its what the other one is called.
example in notes
What line?
- Correlation of X with Y = correlation of Y with X.
- Indeed, these are simply two ways of expressing “correlation between X and Y”.
- But for regression, it does matter which variable is the predictor and which is the outcome.
- Regression line predicting Y with X IS NOT the same as the line predicting X with Y.
The amount of variance explained (R2) and the overall significance of the regression model will be the same when predicting X from Y or Y from X – but the model (i.e. line) will not be the same in both cases (unless R = 1).
- Regression line predicting Y with X IS NOT the same as the line predicting X with Y.
- If (using the same data as previously) use exam score to predict practice, find Slope = 0.413 & Intercept = -7.80.
- Now, often (like here) it only makes sense if one variable is the outcome.
Would be weird to “predict” practice from exam score given practice happens first.
BUT, mathematically, no principled difference between predicting X from Y vs Y from X.
Briefly beyond a line
- Everything up this this point assumes a single predictor variable. But linear regression works perfectly well with multiple predictor variables (indeed, it uses the same general methods).
- The current lecture course does not assume knowledge of linear regression with multiple predictors, but it is useful to know such things exist (the textbook covers this).
- Non-linear regression is also possible – can fit lines that are not straight.
- But non-linear regression does require very different methods to linear regression (and the textbook does not even cover it).
Again, the course does not assume knowledge of non-linear regression, but it is useful to know such things exist.
- But non-linear regression does require very different methods to linear regression (and the textbook does not even cover it).
Limits and assumptions of linear regression
- Only “works” for a straight line – but linear regression will always give the coefficients for “a” line.
- Need to consider other ways to assess if a linear regression is appropriate/informative (e.g. scatter plot to visualise relationship between variables, examine pattern of residuals, examination of significance of model).
- Regression makes essentially the same assumptions as correlation.
Textbook goes into a lot of detail about the assumptions and how to test them. But for the moment this is “for information” and don’t need it for current course.
Example - liking and similarity to humans
- Does similarity to human form predict liking ratings?
“Collected” (i.e. made up) liking ratings of images that varied from not at all human (e.g. robots like R2D2), to actual humans, via a number of intermediate steps (e.g. robots like C3P0, or AI generated images of people).
image in notes
- Regression line exists – but clearly does not “fit” the data well – relationship appears non-linear.
- Data was made up – but reflects a real effect – the “uncanny valley”.
Describes dislike or discomfort experienced when seeing things that are “almost human”.
More generally – ALWAYS visualise the data and see if it fits the analysis you are considering.