Correlation, Linear Regression Flashcards
What is the first thing you should do to look for associations between continuous variables?
Produce a scatter plot
When do you perform a linear regression?
When it is clear that y may be affected by x (in no way could x be affected by y)
When do you perform a correlation analysis?
When it is unclear which variable affects the other
What do both types of analysis look for?
Linear relationship
If relationship appears to be non-linear= Data must be transformed
What are the assumptions of Pearson’s Correlation Test?
Correlation analysis:
Parametric test= Normally distribution
Assumptions:
1) The relationship (if there is one) is linear
2) No clear causation between X and Y= Want to see if they are associated
3) The two samples come from a bivariate normal distribution
4) The data is continuous
IF X and Y are independent= No association= Correlation is 0
What is the test statistic for Pearson’s Correlation Coefficient? What are the null and alternative hypothesis?
ρ
ρ is different from 0
What is the degrees of freedom for Pearson’s Correlation Test?
n-2 where n is number of (x,y) pairs
What is the non-parametric version of the Pearson Correlation Test?
How do you carry out this test?
Spearman Rank Correlation Test- Uses ranks rather than the observations themselves
Null hypothesis: No association between X and Y
Test statistic: rs
n is the number of pairs and di is the difference between the ranks of the x and y for each (x,y) pair
The higher the correlation between X and Y, the closer the ranks association with the X and Y will be, which will make rs higher
The degrees of freedom= n-2
What are the assumptions for linear regression?
1) Variation in data is normally distributed about the mean
2) Variation in y is equal for all x
3) X-values are measured without error
What are the null and alternative hypothesis for linear regression?
Null= No association between X and Y
Alternative: Knowing X tells us something about Y
How do you carry out linear regression?
Sum of squares:
Find the mean of Y
Find the difference between each Y value and the mean
Square the difference to give a SStotal
Plot a straight line through the data of y= a + bx where a= y-intercept and b= slope of line
Using the line, estimate a and b and then calculate y for each x point= Estimated points
Find difference between the estimated point and the actual value and add them all together= SSresidual
SSresidual
What is R square?
It is the coefficient of determination
Indicates how much variation (measured as SS) can be explained by assuming a linear relationship with X
SSregression= SStotal - SSresidual R2= SSregression/SStotal
Larger values of R2= The line better describe the data= Linear relationship
R2= 1, implies all the data lies on a line with non-zero slope