Correlation Flashcards
Correlation
Correlation is not causation
Correlation coefficient = measures strength and direction of the linear association between two numerical variables (reflects the amount of scatter / variation in the association – it does NOT “fit a line” to data)
simpson paradox
correlations appear in different groups of data but disappear or reverse when these groups are combined
Calculationg correlation co-efficient and CIs
1) calculate correlation co-efficient (r for sample and P for population) using equation -> can calculate SE for r and P
Can’t use this to estimate CIs as sampling distrib for r is NOT normally distributed as correlation coefficient is bound between -1 and 1
2) Use Fisher’s Z tranformation to calculate Z (making data normally disitrbuted)
2) calculate SE of Z
3) Use SE to calculate 95% CIs
z- 1.96SE < z < Z+ 1.96SE
4) back transform Z and the CIs to see where CIs actually lie in the data.
Bootstrap method
repeatedly draw samples with replacement from the data to create a sampling distribution
Hypothesis testing: Is this correlation significantly different from 0?
Null hypothesis H0: rho ρ = 0
Alternative hypothesis HA: ρ <>0
Test using t test with student’s t distrib with n-2 degrees of freedom -> n - 2 df as we are using two summaries of the data X-bar and Y-bar to calculate r
t= r/ SEr
SEr= sqrt 1-r^2/ n-2
assumptions of testing correlations
- Both variables are on an interval or ratio level of measurement
- The bivariate data must be normally distributed i.e both variables are normally distributed or approach normality after data transformation.
- Your data have no outliers
- Your data is from a random or representative sample
- You expect a linear relationship between the two variables
- Homogeneity of variance (e.g. not a funnely)
Correlation analyses for non linear relationships
use non-parametric (no assumptions about sampling distrib) -> Spearman’s rank correlation (non-parametric use ranks not numbers)
Spearman’s rank correlation measures strength + direction of linear associated between ranks of 2 variables
Method:
- Rank data for both variables, smallest to largest
- R= rank for X variable
- S= ranl=k for Y variable
- Calculate Rs
Hypothesis testing: is variable significantly different from 0?
ts= rs/ SErs
SErs= sqrt 1-rs^2/ n-2
When n is 100 or less use G table
When N is >100 use t table
Spearmans rank assumptions
- Random sample
- Linear relationship between ranks of numerical variables
Measurement error
Measurement error = when variable not measured perfectly
Measurement error in both x and y causes weaker observed correlation than is true
When sample correlation coefficient r underestimates the value of ρ (rho) this is an attenuation. It can be caused by imprecision – r often underestimates rho