L21 Part 1 - Correlation (chapter 7) Flashcards
What are some takeaways so far from Stats part from SSR?
- Always plot your data - useful to visualise the patterns
- Predicting the DV - look at the predictor variables (one-way, factorial, visual prediction in RM ANOVA) and assess whether they are accurate approximation/prediciton of the DV (high explained variance?)
↪ Total variability of DV = explained variance + unexplained variance → DV = model prediction + error
↪ More explained variance than the unexplained variance (e.g. F>1) = success - Keep your models simple/interpretable
What is Pearson correlation coefficient?
- Referred to as Pearson’s r (Pearson’s product moment correlation)
- Measure of the linear correlation between two continous variables
- It has a value between +1 and -1, where 1 is positive correlation, 0 no correlation, and -1 negative correlation
- Cool because we can immediately tell if the correlation is strong or weak, or no correlation (= standardization → interpret the data without knowing what variables are being measured and with what scale)
What is a bivariate correlation?
Association between two variables
- We can also have association between more variables (later)
What is the formula for correlation?
No need to remember exactly, just understand it
Picture 1
Where S is the standard deviation and COV is the covariance
What does covariance represent in the formula and why do we divide it by standard deviations of the two variables?
The covariance is the unstandardized version of correlation between the two variables
We divide it by the product of the two sd of the variables to standardise it
How can we plot the covariance?
Picture 2
We can plot each data point on a graph and divide it into 4 quadrants
- According to the quadrant which the data point is located in, we can say whether they contribute to positive/negative correlation (red quadrants - neg. cor.; scored higher on one variable but lower on the other, green quadrant - pos. corr.; score higher/lower than average on both variables)
- This represents the different between the data point and the mean of that variable
How do we calculate the covariance?
Picture 3
- For each data point, we calculcate the distances from one variable’s mean (the deviations) and multiply this by the corresponding deviations of the second variable (cross-product deviations)
- Finally, sum those cross-product deviations to get the covariance
- We divide by the number of observations - 1 since we want an average value of the combined deviations for the two variables
What is standardisation, why do we use it and what is the formula?
- A measure of the average deviation from the mean
- It achieves that both variables have SD of 1 and mean of 0
- Use it to be able to compare the two variables
- used to assess outliers
Formula - picture 4
If we plot the standard deviation of each variable, what is it going to look like before and after standardisation?
Picture 5 - Before standardisation (lot more variability)
Picture 6 - After standardisation
What does z-score represent?
How many standard deviations is this observation greater or smaller than the mean
- can be positive (> mean) or negative (< mean)
- thanks to this we can detect extreme scores = outliers
What is another way to calculate covariance using z-scores?
We can calculate the distance between the z-scores and the averages and multiply these (for the pairs of the variables) and then sum those product pairs - gives us the standardised covariance which is basically correlation
↪ Correlation can be seen as the covariance between the z-scores
Picture 7
How to we use the frequentist framework (p-values) to say whether there is evidence against the null hypothesis?
the step-by-step process to get to p-value
- State our null hypothesis and alternative hypothesis
↪ H0: tr = 0 (no association between the two variables)
↪ Ha: tr ≠ 0 (two-sided test)
↪ Ha: tr > 0 (pos. association)
↪ Ha: tr < 0 (neg. association) - We convert our correlation (r) to a t-statistic so that we can use its distribution as our sampling distribution (picture 8); calculate df
- Locate the t in our sampling distribution and find p-value
What does our p-value represent?
The probability of observing this statistic if there is no association between the two variables (H0)
What is the formula to calculate the p-value using z scores?
Picture 12
We don’t necessarily need to know this because Johnny didn’t even mention it and in the book they said that z-scores are almost never used with correlation. I included it here just because we can use it to calculate the confidence intervals which we will go over in the next flashcard
What is a confidence interval for r and how do we calculate it?
Confidence intervals represent the likely correlation in the population (assuming that our sample is one of the 95% for which the confidence interval contains the true value)
Picture 13
Again, not that important to know