STATS 15- correlation and regression Flashcards
Correlation research design
- Experiments may be impractical/Unethical for some research questions
- “Does cholesterol affect the probability of heart disease”
- “Does smoking shorten peoples life expectancy”
- But we can look for relationships between such variables
Relationship between 2 variables
- Is there a relationship between IQ and Exam marks
- Bivariate data- each participant there are 2 different variables measure, we see for any relationships (C.f. within-subject design)

Start with a scatter plot
- Height and intelligence
- Data suggests no relationship between height & intelligence

Strong positive Correlation, R=1
- strong positive correlation

Strong NEGATIVE correlation, R= -1

No correlation- Height and intelligence

Non-linear correlation
- NB: correlations are not about how steep any slope is but about the variation of values around the slope (how well values fit the slope)
- 1= perfect fit
- Strong positive 0-8
- Strong negative 8-14

Correlations: Hypotheses testing
- Null hypothesis
- No relationship between variables X and Y above that expected by chance alone
- NB: Correlations are not about how steep any slope is, but about the variation of values around the slope
Measuring the degree of relationship
- Pearson product moment (r)- how well does a straight line fit the data
- r = -1 (perfect negative relationship)-X decreases as Y increases
- r = +1 (perfect positive relationship)-X increases as Y increases
- r = 0 (No linear relationship)
- Is affected by outliers and by number of pairs of data
Pearsons
- Assumption
- linear relationship between X and Y
- Continuous random variables
- Both variables must be normally distributed
- X and Y must be independent of each other
Measuring the degree of the relationship
- Pearson, r=
- or - sign indicates = direction of relation
- Value indicates strength. Valye closer to either -1 or +1 reflect a strong correlation valyes close too 0 mean weak/no correlation
- r= -0.65 quite strong negative correlation
- r= +0.65 equally strong positive correlation
- P value indicates significance
Some real data
- Positive but not very strong relationship
- With quite a lot of variability

And if we collect more data
*

The null hypothesis (r=0)
- What is the chance that there really isn’t a correlation (r=0)
- OR
- That we got our value of r by chance p<0.05

What it all means
- r= +0.701 ; p<0.001 (N=30)
- 0.701 => How close are the points to a straight line
- p<0.001 =>How likely is it that the true correlation co-efficient is actually zero (no correlation) and we got this r value by chance
- N=30 => how may pairs of points there are
Interpret correlations cautiously
- Correlation does NOT imply causality
- Correlations are affected by range restrictions
- E.g. Height… you don’t get many people above 7ft
- Correlations are affected by outliers
- Correlation only measures the degree of LINEAR relationships
- Plot the graph to see if linear
Correlation and causation
- Ice cream sales and the number of shark attacks on swimmers are correlated
- The number of cavities in primary school children and vocabulary size has a strong positive correlation
- Cant say vocab causes cavities (Probably due to age)
- The more tvs per citizen the longer the average life expectancy of a country
- Patients operated on by surgeons with cleaner hands live longer
Non-parametric alternatives
- Spearman’s Rho
- Spearmans rank correlation co-efficient
- Kendall’s tau
- Cross tabulation, c.f. Chi squared
- Fewer assumptions, robust to outliers, but also less sensitive
Advantages of Non-parametric correlation. ranking 1
- Can convert non-linear data to linear- allow us to perform linear statistical test on the data

Advantages II
- Less sensitive to outliers
- ranking can distribute the data more evenly

Spearman’s Rho (p, rs)
- Tests for a relationship between the ranks of 2 variables
- So put paired variables in tables and rank each one
- Compare differences in ranks
- E.g. Is there a relationship between age and shoe size
Spearman’s Rho- Formula
- d= Difference in rank
- N = Number of participants
- E = sum of

Example- Positive correlation


Reporting the result
- The Spearman’s Rho test was applied to the data and a significant positive correlation was found between age and shoe size
- (rs= 0.9, n=5, p<0.05)
- As shoe size increased, age increased
- No implication of causality
Another rank correlation: Suitable for contingency table data
- contingency table is when you relate on variable to another
- e.g. age of students v classification of degree they got

Why not use chi-squared
- Because three cells have less than 5
- See nonparametric testing lectures
- Can use Kendall’s Tau
- A method for measuring the association between variables in cross tabulations
How does Tau work
- If there was a positive correlation between age and classificaiton
- younger = 1st / oldest= 3rd
- we would expect most of the data to fall between the tram lines if it was positively correlated

How does it work
- we would expect most of the data to fall between the tram lines if it was negatively correlated
- This doesn’t happen
- If there was a positive correlation between age and classificaiton
younger = 1st / oldest= 3rd

How does it work
- Essentially, Kendall’s Tau is a statistic which for each cell compares the number of cases below and to the right of the cell with those above and to the right
*

How does it work
- There is not a similar balance in the data meaning it is not correlated

Reporting the result
- Tau tells us both the size and direction (like Spearman’s Rho) of the correlation
- Computers can also compute the significance value for the sample size used
- E.g. A Kendall Tau test for ordered contingency tables suggested no significant relationship between age and degree class (tau =-0.44, N=180, p>0.05
Reporting correlations
- Describe data (Including scatterplot)
- Describe relationship in words
- Quote N- if we have a large N we are more likely to see real correlation (or lack of)
- Quote co-efficient (with value of p)- links with how significant our results are
Summary of correlations
- Pearsons product moment: r
- Spearman’s Rho: rs- suitable for non-parametric data
- Kendall’s Tau: T- data represented in a contingency table
- When to use each
- What research design
- Normal data