lecture 14 - correlation - additional considerations and ordinal data Flashcards
Pearson’s r correlation coefficient
- This asks, how strongly related are two continuous variables measuring interval (or ratio) data?
- Can take values from:
- -1 (perfect negative relationship),
- To 0 (no relationship),
- +1 (perfect positive relationship).
And all values between -1 and +1.
r = COVxy/ SxSy
understanding covariance - scatterplots
COVxy = ∑(X - X̄)(Y - Y-Bar)/ N - 1
X̄ - means of each variance
graphs in notes
top right quadrant = positive numbers
X - X̄ across and Y -Y-BAR up
bottom left quadrant = positive numbers
Y - Y-bar down and X - X̄ across
top left quadrant = negative numbers
X - X̄ across and Y - Y-bar up
bottom right quadrant = negative numbers
X - X̄ across and Y - Y-bar down
normal covariance
r = COVxy/ SxSy
the larger Sx, the larger COVxy
the larger Sy, the larger COVxy
COVxy = ∑( X - X̄)(Y - Ybar) / N-1
s = √∑(Y - Y-bar) ^2/ N -1
so if we divide COVxy by both Sx and Sy,
we will ‘normalize’ its value.
after normalisation, its value cannot exceed + 1
and cannot be smaller - 1
Relationship effect sizes
Pearson’s r is already an effect size!
Cohen’s rules thumb for the effect size of r:
r = 0.1 small effect size
r = 0.3 medium effect size
r = 0.5 large effect size.
More nuanced research conclusion: the correlation between tulips and roses ratings was very large and positive, r = 0.803, and was marginally significantly different from zero, r(4) = 0.803, p = 0.054, two-tailed
Assumptions for Pearson’s r correlation coefficient - a parametric test
- Random and independent samples
- Normality (of the residuals….) with interval or ratio data
This normality is more complicated here for reasons you’ll understand better later when you’ll learn about the General Linear Model.
For now, looking to see if the distributions of the variables (histograms) are reasonable normal is still a reasonable thing to do… - The variables have a linear (straight line) relationship
An additional worry: Look for outliers in the scatterplot and histograms….
Because Pearson’s r is parametric (involves means, etc.) outliers can have a strong influence on the outcome of the statistical test.
Practical considerations: It’s hard to assess normality with this little data and also what if the data is only ordinal and not interval?
wavy lines
- Pearson’s r correlation coefficient assume a linear (straight line) relationship between the variable.
- BUT, there are many other possible relationships between the variables that aren’t linear.
- You can see this if you’ve plotted a scatterplot J,
but if you’ve only calculated r you can’t L
epidemic initially grow exponentially
however in longer term the disease dies out again
what if the data is ordinal rather than interval? spearmans rho
rank each variable separately.
calculate the means and SD’s of the ranks.
work out COVxy
spearmans rho for ordinal data
rho = COVxy/ SxSy
the uncertainty about the data being interval or ordinal doesnt matter practically
Note. We haven’t given you a table of critical values for Spearman’s rho…..
Practically when should you consider using Spearman’s rho?
When…
… you aren’t sure whether the data is interval or ordinal….
… when you’re unsure if the relationship is linear….
… when you’re uncertain about the normality assumption for Pearson’s r….
… when the data potentially has outliers….
There’s little downside to looking at rho most any time you look at r….
If they’re similar….. Good J
If they’re not …. think hard…. Talk to Field, etc.