Correlations Flashcards
Pearson’s correlation
- Assesses how well an association between two variables can be described using a linear relationship.
- It tells you the direction of the relationship and it also assesses the strength of the relationship between two variables.
- Specifically, Pearson’s correlation is used to determine whether an increase in one variable corresponds to an increase in the other variable.
- It will further tell you how much of an increase in one variable corresponds to an increase in the other variable.
- Such an association is known as a positive relationship.
- Pearson’s correlation can also assess whether increases in one variable correspond to decreases in the other variable.
- The second type of association is known as a negative or an inverse relationship.
When to use Pearson’s correlation
- Use Pearson’s correlation when you have two variables that are scored at the interval or ratio level of measurement (continuous variables).
- You can also use Pearson’s correlation when you have one variable that has been scored at the interval or ratio level but your second variable is a dichotomous variable. This latter type is also known as a point-biserial correlation.
Spearman Rank Order Correlation Coefficient, when to use:
- If one of your two variables has been measured using an ordinal scale with more than two levels then you need to use Spearman rank order correlation coefficient.
- Spearman is also used when the underlying assumptions of Pearson’s cannot be met. It is a nonparametric alternative.
Pearson’s correlation coefficient Reporting
- The symbol, r (note it is lower case! We reserve uppercase R for regression), is used to represent a Pearson’s correlation coefficient.
- Pearson’s correlation coefficients range from +1.00 to –1.00.
- A coefficient of +1.00 represents a perfect positive linear relationship whilst a coefficient of –1.00 represents a perfect negative linear relationship.
- An r of 0.00 represents an absence of a linear relationship.
- The main output you need for reporting Pearson’s correlation coefficients is the size and direction of r, and the p value.
For hypothesis testing, Pearson’s correlation coefficients need to be assessed for statistical significance. What are Cohen’s cut off values.
- Correlation coefficients with magnitudes less than 0.30, irrespective of their statistical significance, are not usually considered very informative or meaningful.
- Such correlations indicate that less than 10% of the variance is shared between two variables. In other words, correlations, which range between 0.30 and –0.30 indicate that there is a poor relationship between two variables.
- Note that this correlation of .3 corresponds with the lower bound for what Cohen (1988; 1992) considered a moderate effect (.10-.30 = small effect, .30-.50 = moderate effect, and >.50 = large effect).
- Pearson’s correlation coefficients are assessed for significance using a t-test.
Significance Test of r
- The t-test tells you whether the association between two variables is a reliably different from r = 0 or whether it is due to chance factors and/or sampling error.
- However, with correlations the APA convention is not to report the associated t-test.
- Can also use a z-score approach
Coefficient of Determination
- The strength of the relationship between two variables is assessed by the coefficient of determination, often given the symbol, r2.
- The coefficient of determination is simply the squared value of the correlation coefficient.
- It tells you the amount of variability in one variable that is shared by the other, and can be expressed as a percentage of the total variability.
- For example, if a correlation coefficient between two variables is 0.50, then 0.25 or 25 percent of the variance in one variable is associated with the variance in the other variable. Another way of saying this is that the two variables share 25 percent of the same information.
- The coefficient of determination, r2, that we encountered for bivariate correlation is also examined in linear regression, although we use capitals for r in regression (e.g., R and R2 instead of r and r2) in order to distinguish between correlations and regression.
- In linear regression, R2,is used to determine the proportion of variance in your dependent variable that is predicted from your independent variable(s). If R2 is 0.25 then this means that 25 percent of the variability in the dependent variable,Y, can be predicted form the association withX, the independent variable. The remaining, unpredicted portion, 1-R2, is defined as error variability.
Bivariate regression
- Use knowledge of one variable to predict scores on a second variable
- The variable you use to make the prediction is known as the independent variable or the predictor. The variable you are interested in predicting is known as the dependent variable or the criterion.
- it is not intended to imply a causal relationship
- the stronger the correlation, the more reliable the prediction
Bivariate Regression - Y intercept:
The Y-intercept, a, also referred to as the “constant”, indicates the value for a dependent variable when the value for an independent variable is equal to zero. It is particularly useful for describing a linear relationship as the Y-intercept marks the point at which the line crosses the vertical axis of the graph. The Y-intercept is also evaluated for statistical significance using the t-test. The t-test assesses whether the Y-intercept is significantly different from zero.
Factors that affect correlations
- These include the range restriction of either variable X or Y
- the use of heterogenous samples
- the distribution of scores on a dichotomous measure and
- the presence of outliers.
Range Restriction
- A range restriction in the sample for any measure usually produces poor correlations between that measure and others, although it can instead produce an inflated correlation or even produce a correlation in the opposite direction to what would be found if you sampled across the whole range of possible values.
- The possibility of range restriction is indicated by a small standard deviation relative to the possible range of values for the measure.
- In effect when you have a range restriction you are only examining a part of the relationship between two variables and this can be misleading. Unless the measure is essential to the analysis, it is better deleted.
The following website provides an interactive demonstration of range restriction, and is worth having a look at to solidify this concept in your mind:
(https://emilkirkegaard.shinyapps.io/Understanding_restriction_of_range/)
The use of heterogenous samples
- The use of heterogenous samples can also distort the nature of the relationship between two variables.
- If you have two distinct groups in your study but you examine the relationship between variables for the whole group then this can also distort the findings.
- It may lead to higher or lower correlations than if you have examined the two groups separately.
- When you suspect that the X-Y relationship may differ across known groups in your dataset (e.g., gender), run the correlation separately for each group to check.
Unequal split of responses on a dichotomous measure
Another condition that can produce a poor correlation is a very unequal split of responses on a dichotomous measure (e.g., 90% v. 10%). The correlation between that measure and others will be deflated. Once again, the best solution is to delete the measure from further analysis.
The presence of outliers
- The presence of outliers for any measure will also affect correlations between that measure and others.
- Outliers are scores with extreme values of more than three standard deviations beyond the mean.
- A strong correlation can be heavily reduced by the presence of one or more outliers.
- Check that outlier values have been correctly entered and read (i.e., are not missing value indicators).
- If correct, either delete the case from further analysis, run the model with and without the outlier(s) (to see the extent to which these outliers distort the results and conclusions you draw from these results), or change the value so that it is less extreme.
- Common methods of adjustment are either to select a replacement value that is one unit more extreme than the value for the most extreme nonoutlier or to replace the value with the raw score equivalent of three Z-scores beyond the mean.
Assumptions of the General Linear Model
- Normality
- Linearity
- Data Transformations