L21 Part 1 - Correlation (chapter 7) Flashcards

1
Q

What are some takeaways so far from Stats part from SSR?

A
  1. Always plot your data - useful to visualise the patterns
  2. Predicting the DV - look at the predictor variables (one-way, factorial, visual prediction in RM ANOVA) and assess whether they are accurate approximation/prediciton of the DV (high explained variance?)
    ↪ Total variability of DV = explained variance + unexplained variance → DV = model prediction + error
    ↪ More explained variance than the unexplained variance (e.g. F>1) = success
  3. Keep your models simple/interpretable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Pearson correlation coefficient?

A
  • Referred to as Pearson’s r (Pearson’s product moment correlation)
  • Measure of the linear correlation between two continous variables
  • It has a value between +1 and -1, where 1 is positive correlation, 0 no correlation, and -1 negative correlation
  • Cool because we can immediately tell if the correlation is strong or weak, or no correlation (= standardization → interpret the data without knowing what variables are being measured and with what scale)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a bivariate correlation?

A

Association between two variables
- We can also have association between more variables (later)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the formula for correlation?

No need to remember exactly, just understand it

A

Picture 1
Where S is the standard deviation and COV is the covariance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does covariance represent in the formula and why do we divide it by standard deviations of the two variables?

A

The covariance is the unstandardized version of correlation between the two variables
We divide it by the product of the two sd of the variables to standardise it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can we plot the covariance?

A

Picture 2
We can plot each data point on a graph and divide it into 4 quadrants

  • According to the quadrant which the data point is located in, we can say whether they contribute to positive/negative correlation (red quadrants - neg. cor.; scored higher on one variable but lower on the other, green quadrant - pos. corr.; score higher/lower than average on both variables)
  • This represents the different between the data point and the mean of that variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we calculate the covariance?

A

Picture 3
- For each data point, we calculcate the distances from one variable’s mean (the deviations) and multiply this by the corresponding deviations of the second variable (cross-product deviations)
- Finally, sum those cross-product deviations to get the covariance
- We divide by the number of observations - 1 since we want an average value of the combined deviations for the two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is standardisation, why do we use it and what is the formula?

A
  • A measure of the average deviation from the mean
  • It achieves that both variables have SD of 1 and mean of 0
  • Use it to be able to compare the two variables
  • used to assess outliers

Formula - picture 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

If we plot the standard deviation of each variable, what is it going to look like before and after standardisation?

A

Picture 5 - Before standardisation (lot more variability)
Picture 6 - After standardisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does z-score represent?

A

How many standard deviations is this observation greater or smaller than the mean
- can be positive (> mean) or negative (< mean)
- thanks to this we can detect extreme scores = outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is another way to calculate covariance using z-scores?

A

We can calculate the distance between the z-scores and the averages and multiply these (for the pairs of the variables) and then sum those product pairs - gives us the standardised covariance which is basically correlation
↪ Correlation can be seen as the covariance between the z-scores
Picture 7

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to we use the frequentist framework (p-values) to say whether there is evidence against the null hypothesis?

the step-by-step process to get to p-value

A
  1. State our null hypothesis and alternative hypothesis
    ↪ H0: tr = 0 (no association between the two variables)
    ↪ Ha: tr ≠ 0 (two-sided test)
    ↪ Ha: tr > 0 (pos. association)
    ↪ Ha: tr < 0 (neg. association)
  2. We convert our correlation (r) to a t-statistic so that we can use its distribution as our sampling distribution (picture 8); calculate df
  3. Locate the t in our sampling distribution and find p-value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does our p-value represent?

A

The probability of observing this statistic if there is no association between the two variables (H0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the formula to calculate the p-value using z scores?

A

Picture 12
We don’t necessarily need to know this because Johnny didn’t even mention it and in the book they said that z-scores are almost never used with correlation. I included it here just because we can use it to calculate the confidence intervals which we will go over in the next flashcard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a confidence interval for r and how do we calculate it?

A

Confidence intervals represent the likely correlation in the population (assuming that our sample is one of the 95% for which the confidence interval contains the true value)
Picture 13

Again, not that important to know

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

To see how covariance and correlation change when the standard deviations change, look at picture 9 - it has my annotations on Johnny’s applet so if you can’t read something, let me know

Apologies for the size of my writing, you will have to zoom in the picture hihi

A
17
Q

What is the coefficient of determination, R^2?

A

A measure of the amount of variability in one variable that is shared by the other
- We calculate it by squaring r and we can then convert it to percentages by multiplying it by 100
- E.g. r between exam performance and exam anxiety is -0.441 → (-0.441)^2 = 0.194 → exam anxiety shares 19.4% of the variability in exam performance (80.6% of the variability is unexplained)
- The remainder of the percentages is the unexplained variability

18
Q

An example

A

Variables - exam performance (Y), exam anxiety (X) and time spent revising (Z) (the variable we will control for)
- If we want to see the unique relationship between exam anxiety and exam performance we need to account for revision time
- This can be done in two ways: semi-partial correlation and partial correlation

The corresponding correlations of the pairs are displayed in picture 14 which we can turn into percentages by multiplying them by 100
Now, the different shared variances can be showed by a diagram in picture 15. It’s a bit complicated so that some time to understand it
- E.g. The correlation between exam performance and revision time tells us that they share 15.8% of the variance, but in reality, only 1.5% is unique to exam anxiety whereas the remaining 14.3% (area C) is also shared with revision time

19
Q

What is semi-partial correlation?

A

Expresses the unique relationship between two variables as a function of their total variance

20
Q

How can we see semi-partial correlation with the example?

A

The semi-partial correlation squared (sr^2) between exam performance and exam anxiety is the uniquely overlapping area (A) expressed as a function of the total area of exam performance (A + B + C + E)
↪ A/(A+B+C+E) - picture 16
Semi-partial correlation is the relationship between X and Y (area A) accounting for the overlap in Y and Z (areas C and B), but not the overlap in X and Z (area D)
In our specific example, the semi-partial correlation for exam performance and exam anxiety (area A) quantifes their relationship accounting for the overlap between exam anxiety and revision time (area C) but not the overlap between exam performance and revision time (area B)

21
Q

What is partial correlation?

A

Quantifies an association between two variables while taking into account a third variable that we want to control for
Picture 10 - venn diagram that demonstrates this

22
Q

Explain the venn diagram in terms of shared variance between the different variables

A
  • If we want to compute the correlation between X and Y, we look at the overlap of their shared variance (50%) - the larger the overlap, the more variance they share, the better they predict each other
  • If we look at this overlap and take into account the variance explained by Z, we see that that decreases the variance shared purely by X and Y (to 25%) because of the association between Z and Y, and Z and X
  • Partial correlation: variance in Y accounted for by variable X after removing effects of variable Z
23
Q

How can we fit our example with partial correlation?

A

We can express the unique variance in terms of the variance in Y left over when other variables have been considered
↪ A/(A+E) - picture 17
Partial correlation squared (pr^2) between
exam performance and exam anxiety is the uniquely overlapping area (A) expressed as a function of the area of exam performance that does not overlap with revision time (A + E)
- the partial correlation for exam performance and exam anxiety (area A) adjusts for both the overlap in exam anxiety and revision time (area C) and the overlap in exam performance and revision time (area B)

24
Q

What is the difference between semi-partial correlation and partial correlation?

A

By ignoring the variance in Y that overlaps with Z, a partial correlation adjusts for both the overlap that X and Z have and the overlap in Y and Z, whereas a semi-partial correlation adjusts only for the overlap that X and Z have

25
Q

Formula for partial correlation?

A

Picture 11
The little dot between xy and z indicates that we are partialing out (controlling for) a third variable z
↪ the simple correlation between our variables is adjusted to account for the correlations between each main variable and the confounding variable z (numerator)
- The way we do that is by taking the three correlations between each of the three variables

26
Q

What does it mean if the partial correlation is higher or lower than the simple correlation?

It’s a long flashcard but I wanted to put it together because it makes sense to compare them while reading it

A

Picture 18 shows the two scenarios

  • Higher → we are able to control for the meddling of the third variable (i.e. its inclusion just creates more confusion in the data and isn’t valuable since the relationship between the two main variables is already strong enough) (scenario 1)
    ↪ the third variable removed from the equation, reveals a clearer, potentially more accurate relationship between the two main variables (the confounding variable suppresses the interaction between the two main variables)
  • Lower = when the third variable shares a lot of overlap with A(X) and B(Y) variables (explains lot of variability in A(X) and in B(Y); it’s a confounding variable)
    ↪ The simple correlation includes the shared variance introduced by C(Z), boosting the observed association between A and B. After controlling for C, the partial correlation isolates the direct association between A and B, which is weaker than the simple correlation in this scenario (scenario 2)
27
Q

How do we check the significance of partial correlation?

A

The same way as for simple correlation (t-statistic converted to p-value, reject/not H0)
The null hypothesis is that there is no association between the two variables after controlling for effects of confounding variables

28
Q

Assumptions for the Pearson correlation coefficient

A
  1. Continous variables, measure at an interval or ratio level
  2. Normal distribution of the variables (and their residuals)
29
Q

Which non-parametric statistics can we use when assumptions of correlation are violated?

A
  1. Spearman’s correlation coefficient (rho)
  2. Kendall’s tau
30
Q

What is Spearman’s rho and when do we use it?

A
  • When our data have outliers, are not normal or our variables are measured at the ordinal level
  • Computed by first ranking the data and then applying Pearson’s equation to those ranks
31
Q

When do we use Kendall’s tau instead of Spearman’s rho?

A

When we have small data set with a large number of tied ranks

32
Q

What is point-biserial correlation and biserial correlation?

This was only in the book written in 1 page so I don’t think it’s that important but I’m putting it here just in case and so that you are aware that there is something like this as well

A

Both are used when one of the variables is dichotomous (categorical with only two categories)
- The difference between the two is whether the dichotomous variable is discrete or continous
Point-biserial correlation - discrete dichotomous variable, i.e. one or the other (e.g. being pregnant or being dead - you can only be dead or alive, nothing in between)

Biserial correlation - continous dichotomous variable, i.e. there is a continuum between the categories (e.g. passing or failing an exam - you can just barely pass or excel completely)

Biserial cannot be calculated in JASP, you have to transform the data to point-biserial coefficient and then adjust the value (I don’t think we need to know this it was under the extra info box)

33
Q

How to report correlation coefficients

A

Report the r value, confidence interval, and significance value (no effect size since coefficient is already effect size)

E.g. Creativity was significantly related to a person’s placing in the World’s Biggest Liar competition, r = −0.30, 95% Bootstrap CI [−0.48, −0.11], p = 0.001

34
Q

Correlation in JASP is in picture 19

If you have any questions ask me

A