Lecture 36- Correlation Flashcards
What does the correlation coefficient (r) summarize?
The strength of a linear relationship between variables as well as the direction of this relationship
How do you interpret r i.e. what values mean what?
- r is always between -1 to +1
- A positive r value means Y and X increase together
- A negative r value means as Y increases X decreases (and vice versa: basically what ever one variable does the other variable is doing the opposite thing).
What does r=0 mean?
There is no linear relationship between the variables
What does a strong/ weak relationship look visually?
- Weak= more scatter
- Strong= Points clustered heavily around the line of best fit
Calculate the correlation coefficient using the data on slide 694 and the equation found here (don’t need to memorize)?
Answers on slide
What function in r calculates the correlation coefficient?
cor(x,y)
How do you set up data in r?
x=c(data)
y=c(data)
Note: can use = or a backwards arrow
What is S subscript xy?
The sample covariance between x and y
What can the correlation coefficient ‘r’ be rewritten as?
S(subscript xy)/ Sx times Sy
Note: Sx and Sy are the sample standard deviations for the x and y variables
Can a correlation coefficient be used for prediction? Why or why not?
No, because its not a model
What is meant by the statement that the correlation coefficient is symmetric in variables?
Correlation between x and y is the same as correlation between y and x
What is R^2?
- The coefficient of determination: how well does our regression model describe the data
- Is the squared correlation between the observed and predicted responses
How do you interpret R^2 i.e. what does the numbers mean?
- Close to 1= regression model describes the data well
- Low value (close to 0) indicates a regression that describes the data poorly
(can only be between 0 and 1, not such thing as a negative R squared value because squaring by nature removes negative signs)
What does the total sum of squares describe in contrast to R^2?
- Total sum of squares (TSS)= overall variation in the response variable
- R^2 is instead the proportion of variation in the response that is explained by the predictor variable i.e. how good our model is
What is the residual sum of squares (RSS)?
- The total variation of the data points about the regression line i.e how far are our measured y values from the prediction (according to our fitted model)
- In other words RSS is the variation not explained by the regression model