Week 3: Correlation Flashcards
A general approach in regression is that our outcomes can be predicted by a model and what remains
is the error
The i in the general model in regression shows
e.g., outcome 1 is equal to model plus error 1 and outcome 2 is equal to model plus error 2 and so on…
For correlation, the outcome is modelled by
scaling (multiplying by a constant) another variable
Equation of correlation model
What does this equation of correlation mean and what does b1 mean? - (2)
‘the outcome for an entity is predicted from their score on the predictor variable plus some error’.
model is described by a parameter, b1, which in this context represents the relationship between the predictor variable (X) and the outcome.
If you have a 1 continous variable which meets assumtpion of parametric test then you can conduct a
pearson correlation or regression
Variance is a feature of outcome measurements we have obtained and we want to predict with a model in correlation/regression that…
captures the effect of the predictor variables we have manipulated or measured
Variance of a single variable represents the
average amount that the data cary from the mean
Variance is the standard deviation
squared (s squared)
var
Variance formula - (2)
xi minus average of all scores of pp which is squared and divided by total number of participants minus 1
done for each participant (sigma)
Variance is SD squared meaning that it captures the
average of the squared difference the outcome values from the mean of all outcomes (explaining what the formula of variance does)
Covariance gathers information on whether
one variable covarys with another
In covariance if we are interested whether 2 variables are related then interested whether changes in one variable are met with changes in other
therefore.. - (2)
when one variable deviates from its mean we
would expect the other variable to deviate from its mean in a similar way.
So, if one variable increases then the other, related variable, should also increase or even decrease at the same level.
The simplest way to look at whether 2 variables are associated is to look at whether they.. which means.. - (2)
covary
How much scores of two variables deviate from their respective means
If one variable covaries with another variable then it means these 2 variables are
related
To get SD from variance then you would
square root variance
What would you do in covariance formula in proper words? - (5)
- Calculate the error between the mean and each subject’s score for the first variable (x).
- Calculate the error between the mean and their score for the second variable (y).
- Multiply these error values.
- Add these values and you get the product deviations.
- The covariance is the average product deviations
Example of calculaitng covariance and what does answer tell you?
The answer ispositive: that tells us the x and y values tend to risetogether.
What does each element of covariance formula stand for? - (5)
X = the value of ‘x’ variable
Y = the value of ‘y’ variable
X(line) = mean of ‘x’ - e.g., green
Y(line) = mean of ‘y’ - e.g., blue
n = the number of items in the data set
covariance will be large when values below
the mean for one variable
What does a positive covariance indicate?
as one variable deviates from the mean, the other
variable deviates in the same direction.
What does this diagram show? - (5)
- Green line is average number of packetts bought
- Blue line is average number of adverts watchedVertical lines represent deviations/residuals between obsered variables and circles represent means
- There is a similar pattern of deviations of both variables as person’s score below mean for one variable then score is other variable is below mean too
- We know similarity we are seeing between two variables is calculating covariance = divide cross-product deviations( deviations of 2 variables) divided by number of observations minus 1
- We devide n-1 as unsure of true population mean and related to DF.
What does negative covariance indicate?
a negative covariance indicates that as one variable deviates from the mean (e.g. increases), the other deviates from the mean in the opposite direction (e.g. decreases).
What is the problem of covariance as a measure of the relationship between 2 variables? - (5)
dependent upon the units /scales of measurement used
So covariance is not a standardised measure
e.g., if 2 variables measured in miles and covariance is 4.25 then if we convert data to kilometres then we have to calculate covariance again and see it increases to 11.
Dependence of scale measurement is a problem as can not compare covariances in an objective way –> can not say whether covariance is large or small to another data unless both data sets measured in same units
So we need to STANDARDISE it.
What is the process of standardisaiton?
To overcome the problem of dependence on the measurement scale, we need to convert
the covariance into a standard set of units
How to standardise the covariance?
dividing by product of the standard deviations of both variables.
Formula of standardising covariance
Same formula of covariance but multipled of SD of x and SD of y
Formula of Pearson’s correlation coefficient, r
Example of calculating Pearson’s correlation coefficient, r - (5)
standard deviation for the number of adverts watched (sx)
was 1.67,
SD of number of packets of crisps bought (sy) was 2.92.
If we multiply these together we get 1.67 × 2.92 =
4.88.
.Now, all we need to do is take the covariance, which we calculated a few pages ago as being 4.25, and divide by these multiplied standard deviations.
This gives us r = 4.25/
4.88 = .87.
The standardised version of covariance is the
correlational coefficient or Pearson’s r
Pearson’s R is … version of covariance meaning independent of units of measurement
standardised
What does correlation describe? - (2)
Describes a relationship between variables
If one variable increases, what happens to the other variable?
Pearson’s correlation coefficient r was also called the
product-moment correlation
Linear relationship and normally disturbed data and interval/ratio and continous data is assumed in
Pearson’s r correlation coefficient
Pearson Correlation Coefficient varies between
-1 and +1 (direction of relationship)
The larger the R Pearson’s correlation coefficient value, the closer the values will
be with each other and the mean
The smaller R Pearson’s correlation coefficient values indicate
there is unexplained variance in the data and results in the data points being more spread out.
What does these two graphs show? - (2)
- example of high negative correlation. The data points are close together and are close to the mean.
- On the other hand, the graph on the right shows a low positive correlation. The data points are more spread out and deviate more from the mean.
The Pearson Correlation Coefficient measures the strength of a relationhip
between one variable and another hence its use in calculating effect size
A Pearson’s correlation coefficient of +1 indicates
two variablesare perfectly positively correlated, so as one variable increases, the other increases by a
proportionate amount.
A Pearson’s correlation coefficient of -1 indicates
a perfect negative relationship: if one variable increases, the other decreases by a proportionate amount.
Pearson’s r
+/- 0.1 means
small effect
Pearson’s r
+/- 0.3 means
medium effect
Pearson’s r
+/- 0.5 means
large effect
In Pearson’s correlation, we can test the hypothesis that - (2)
correlation coefficient is different from zero
(i.e., different from ‘no relationship’)
In Pearson’s correlation coefficient, we can test the hypothesis that the correlation is different from 0
If we find our observed coefficient was very unlikely to happen if there was no effect in population then gain confidence that
relationship that
we have observed is statistically meaningful.
. In the case of a correlation
coefficient we can test the hypothesis that the correlation is different from zero (i.e. different
from ‘no relationship’).
There are 2 ways to test this hypothesis
- Z scores
- T-statistic
SPSS for Pearson’s correlation coefficient, r does not compute
confidence intervals in r
Confidence intervals tells us something
likely correlation in the population
Can calculate confidence intervals of Pearson’s correlation coefficient by transforming formula of CI
Example of calculating CIs for Pearson’s correlation coefficient, r
If we have zr as 1.33 and SEzr as 0.71 - (4)
- LB = 1.33 - (1.96 * 0.71) = -0.062
- UB = 1.33 + (1.96 * 0.71) = 2.72
- Have to convert values of LB and UB as in z metric to r correlaiton coefficient using formula in diagram
- This gives UB of 0.991 and LB of -0.062 (since value so close to 0 transformation from z to r has no impact)
As sample size increases, so the value of r at which a significant result occurs
decreases e.g 20 n p is not < 0.05 but at 200 pps it is p < 0.05
Example of a negative relationship
Link between age you die and number of cigarettes you smoked
Pearson’s r = 0 means - (2)
indicates no linear relationship at all
so if one variable changes, the other stays the same.
Correlation coefficients give no indication of direction of… + example - (2)
causality
e.g., although we conclude no of adverts increase nmber of toffees bought we can’t say watching adverts caused us to buy toffees
We have to be caution of causality in terms of Pearson’s correlation r as - (2)
- Third variable problem - causality between variables can not be assumed in any correlation
- Direction of causality: Correlation coefficients give nothing about which variables causes other to change.
If you got weak correlation between 2 variables = weak effect then take a lot of measurements for that relationship to be
significant
R correlation coefficient gives the ratio of
covariance to a measure of variance
Example of correlations getting stronger
R squared is known as the
coefficient of determination
of cor
R^2 can be used to explain the
proportion of the variance for a dependent variable )outcome) that’s explained by an independent variable . (predictor)
Example of R^2 coefficient of determination - (2)
X = exam anxiety
Y = exam performance
If R^2 = 0.194
19.4% of variability in exam performance can be explained by exam anxiety
the variance in y accounted for by x’,
R^2 calculate the amount of shared
variance
Example of r and R^2
Multiply 0.1 * 0.1 for example
R^2 gives you the true strength of.. but without
the correlation but without an indication of its direction.
What are the three types of correlations? - (3)
- Bivarate correlations
- Partial correlations
- Semi-partial or part correlations
Whats bivarate correlation?
relation between 2 variables
What is a partial correlation?
looks at the relationship between two variables while ‘controlling’ the effect of one or more additional variables.
The partial correlation partials out the
the effect of one or more variables on either X or Y
A partial correlation controls for third variable which is made from - (3)
- A correlation calculates each data points distance from line (residuals)
- This is the error relative to the model (unexplained variance)
- A third variable might predict some of that variation in residuals
The partial correlation compares the unique variation of one variable with the
unfiltiered variation of the other
The partial correlation holds the
third variable constant (but we don’t manipulate these)
Example of partial correlation- (2)
For example, when studying the effect of a diet, the level of exercise might also influence weight loss
We want to know the unique effect of diet, so we need to partial out the effect of exercise
Example of Venn Diagram of Partial Correlation - (2)
Partial Correlation between IV1 and DV = D / D+C
Unique variance accounted for by the predictor (IV1) in the DV, after accounting for variance shared with other variables.
Example of Partial Correlation - (2)
Partial correlation: Purple / Red + Purple
If we were doing just a partial correlation, we would see how much exam anxiety is influencing both exam performance and revision time.
Example of partial correlation and semi-partial correlation - (2)
The partial correlation that we calculated took
account not only of the effect of revision on exam performance, but also of the effect of revision on anxiety.
If we were to calculate the semi-partial correlation for the same data, then this would control for only the effect of revision on exam performance (the effect of revision
on exam anxiety is ignored).
In partial correlation, the third variable is typically not considered as the primary independent or dependent variable. Instead, it functions as a
control variable—a variable whose influence is statistically removed or controlled for when examining the relationship between the two primary variables (IV and DV).
The partial correlation is
The amount of variance the variable explains
relative to the amount of variance in the outcome that is left to explain after the contribution of other predictors have been removed from both the predictor and outcome.
These partial correlations can be done when variables are dichotomous (including third variable) e.g., - (2)
we could look at the relationship between bladder relaxation (did the person wet themselves or not?) and the number of large tarantulas crawling up your leg controlling for fear of spiders
(the first variable is dichotomous, but the second variable and ‘controlled for’ variables are continuous).
What does this partial correlation output show?
Revision time = partial, controlling for its effect
Exam performance = DV
Exam anxiety = X - (5)
- . First, notice that the partial correlation between exam performance and exam anxiety is −.247, which is considerably less than the correlation when the effect of
revision time is not controlled for (r = −.441). - . Although this correlation is still statistically significant (its p-value is still below .05), the relationship is diminished.
- value of R2 for the partial correlation is .06, which means that exam anxiety can now account for only 6% of the variance in exam performance.
- When the effects of revision time were not controlled for, exam anxiety shared 19.4% of the variation in exam scores and so the inclusion of revision time has severely diminished the amount of variation in exam scores shared by anxiety.
- As such, a truer measure of the role of exam anxiety has been obtained.
Partial correlations are most useful for looking at the unique
relationship between two variables when
other variables are ruled out
In a semi-partial correlation we control for the
effect that
the third variable has on only one of the variables in the correlation
The semi partial (part) correlation partials out the - (2)
Partials out the effect of one or more variables on either X or Y.
e.g. The amount revision explains exam performance after the contribution of anxiety has been removed from the one variable (usually the predictor- e.g. revision).
The semi-partial correlation compares the
unique variation of one variable with the unfiltered variation of the other.
Diagram of venn diagram of semi-partial correlation - (2)
- Semi-Partial Correlation between IV1 and DV = D / D+C+F+G
Unique variance accounted for by the predictor (IV1) in the DV, after accounting for variance shared with other variables.
Diagram of revision and exam performance and revision time on semi-partial correlation - (2)
- purple/red + purple + white+ orange
- When we use semi-partial correlation to look at this relationship, we partial out the variance accounted for by exam anxiety (the orange bit) and look for the variance explained by revision time (the purple bit).
Summary of partial correlation and semi-partial correlation - (2)
A partial correlation quantifies the relationship between two variables while accounting for the effects of a third variable on both variables in the original correlation.
A semi-partial correlation quantifies the relationship between two variables while accounting for the effects of a third variable on only one of the variables in the original correlation.
Pearson’s product-moment correlation coefficient (described earlier) and Spearman’s rho (see section 6.5.3) are examples of
of bivariate correlation coefficients.
Non-parametric tests of correlations are… (2)
- Spearman’s roh
- Kendall’s tau test
In spearman’s rho the variables are not normally distributed and measures are on a
ordinal scale (e.g., grades)
If your data non-normal and not measured at interval level then
Deselect Pearson’s R tick box
Spearman’s rho works on by
first ranking the data n(numbers converted into ranks), and then running Pearson’s r on the ranked data
Spearman’s correlation coefficient, rs, is a non-parametric statistic and so can be used when the data have
data have violated parametric assumptions such as nonnormally distributed data
In spearman correlation coefficient is sometimes called
Spearman’s rho
For spearman’s r we can get R squared but it is interpreted slightly different as
proportion of
variance in the ranks that two variables share.
Kendall’s tau used rather than Spearman’s coefficient when - (2)
when you have a small data set with a large number of
tied ranks.
This means that if you rank all of the scores and many scores have the same rank, then Kendall’s tau should be used
Kendall’s tau test - (2)
For small datasets, many tied ranks
Better estimate of correlation in population than Spearman’s ρ
Kendall’s tau is not numerically similar to r or rs (spearman) and so tau squared does not tell us about
proportion of
variance shared by two variables (or the ranks of those two variables).
The Kendall’s tau is 66-75% smaller than both Spearman’s r and Pearson’s r so
tau is not comparable to r and r s
There is a benefit using Kendall’s statistic than Spearman as it shows - (2)
Kendall’s statistic is actually a better estimate of the correlation in the population
we can draw more accurate generalizations from Kendall’s statistic than from Spearman’s.
Whats the decision tree for Spearman’s correlation? - (4)
- What type of measurement = continous
- How many predictor variables = one
- What type of continous variable = continous
- Meets assumption of parametric tests - No
The output of Kendall and Spearman can be interpreted the same way as
Pearson’s correlation coefficient r output box
The biserial and point-biserial correlation coefficients used when
one of the two variables is dichotomous (e.g., example of dichotomous variable is women being pregnant or not)
What is the difference between biserial and point-biserial correlations?
depends on whether the dichotomous variable is discrete or continuous
The point–biserial correlation coefficient (rpb) is used when
one variable is a
discrete dichotomy (e.g. pregnancy),
biserial correlation coefficient (rb) is used
when - (2)
one variable is a continuous dichotomy (e.g. passing or failing an exam).
e.g. An example is passing or failing a statistics test: some people will only just fail while others will fail by
a large margin; likewise some people will scrape a pass while others will clearly excel.
The biserial correlation coefficient can not be calculated directly in SPSS as - (2)
must calculate the point–biserial correlation coefficient
and then use an equation to adjust that figure
Example of when point=biserial correlation used - (3)
- Imagine interested in relationship between gender of a cat and how much time it spent away from home
- Time spent away is measured in interval level –> mets assumptions of parametric data
- Gender is discrete dichotomous variable coded with 0 for male and 1 for female
What does this point-biserial correlation output from SPSS show? - (4)
- Point-biserial correlation coefficient is r = 0.378 with p value of 0.001
- Sign of correlation coefficient dependent on which category you assign to code so ignore about direction of relationship
- R^2 = (0.378) squared is 0.143
- Conclude that 14.3% of variability in time spent away from home is explained by gender
Can convert point-biserial correlation coefficient into
biseral correlation coefficient
Point biserial and biserial correlation differ in size as
biserial correlation bigger than point biserial
Example of queston conducting Pearson’s r (4) -
The researchers was interested in whether the amount someone gets paid and amount of holidays they take from work, whether these two variables would be related to their productivity at work
- Pay: Annual salary
- Holiday: Number of holiday days taken
- Productivity: Productivity rating out of 10
Example of Pearson’s r scatterplot :
relationship between pay and productivity
If we have r = 0.313 what effect size is it?
medium effect size
±.1 = small effect
±.3 = medium effect
±.5 = large effect
What does this scatterplot show?
o This indicates very little correlation between the 2 variables
What will a matrix scatterplot show?
the relationship between all possible combinations of your variables
What does this scatterplot matrix show? - (2)
- For Pay and Holiday, we can see the line is very flat and indicates the correlation between the two variables is quite low
- For pay and productivity, the line is steeper suggesting the correlation is fairly substantial between these 2 variables and same for holidays and pay and productivity and holidays here
What is degrees of freedom for correlational analysis?
N-2
What does this Pearson’s correlation r output show? - (4)
- The relationship between pay and holidays is very low correlation is -0.04
- Between pay and productivity, there is a medium size correlation of r = 0.313
- Between holidays and productivity there is medium going on large effect size of 0.435
- Relationship between pay and productivity and also holidays and productivity is sig but correlation with pay and holidays was not sig
Another examp;e of Pearson’s correlation r question - (3)
A student was interested in the relationship between the time spent preparing an essay, the interestingness of the essay topic and the essay mark received.
He got 45 of his friends and asked them to rate, using a scale from 1 to 7, how interesting they thought the essay topic was (1 - I’ll kill myself of boredom, 4 - it’s not too bad!, 7 - it’s the most interesting thing in the world!) (interesting).
He then timed how long they spent writing the essay (hours), and got their percentage score on the essay (essay).
Example of interval/ratio continous data needed for Pearson’s r for IV and DV - (2)
- Interval scale: difference between 10 degrees C and 20 degrees is same as 80 F and 90 F, 0 degrees does not mean absence of temp
- Ratio: Height as 0 cm means no weight and weight, time
Pearson’s correlation r , spearman and kendall equires
one IV and one DV
Spearman and Kendall typically used on ordinal or ranked data - (3)
values ordered and ranked but values between them not uniform
e.g., likert scale from strongly dsiagree to strongly agree
education levels like elemenatry school, high school
rankings like 1st place to 10th place
What does this SPSS output show?
A. There was a non-significant positive correlation between interestingness of topic and the amount of time spent writing. There was a non-significant positive correlation between time spent writing an essay and essay mark
There was a significant positive correlation between interestingness of topic and essay mark, with a medium effect size
B. There was a significant positive correlation between interestingness of topic and the amount of time spent writing, with a small effect size.There was a significant positive correlation between time spent writing an essay and essay mark, with a large effect size. .There was a non-significant positive correlation between interestingness of topic and essay mark
C. There was a significant negative correlation between interestingness of topic and the amount of time spent writing, with a medium effect size.. There was a non-significant positive correlation between time spent writing an essay and essay mark. There was a non-significant positive correlation between interestingness of topic and essay mark
D. There was a significant positive correlation between interestingness of topic and the amount of time spent writing, with a large effect size. There was a non-significant positive correlation between time spent writing an essay and essay mark There was a non-significant positive correlation between interestingness of topic and essay mark
D. There was a significant positive correlation between interestingness of topic and the amount of time spent writing, with a large effect size. There was a non-significant positive correlation between time spent writing an essay and essay mark There was a non-significant positive correlation between interestingness of topic and essay mark
r = 0.21 effect size is..
in between small and medium effect
Effect size is only meaningful if you evaluatte it witth regards to
your own research area
Biserial correlaion is when
one variable is dichotomous, but there is an underlying continuum (e.g. pass/fail on an exam)
Pointt biserial correlation is when
When one variable is dichotomous, and it is a true dichotomy (e.g. pregnancy)
Example of dichotomous relationship - (3)
- example of a true dichotomous relationship.
- We can compare the differences in height between males and females.
- Use dichotomous predictor of gender