Correlations Flashcards

1
Q

Pearson’s correlation

A
  • Assesses how well an association between two variables can be described using a linear relationship.
  • It tells you the direction of the relationship and it also assesses the strength of the relationship between two variables.
  • Specifically, Pearson’s correlation is used to determine whether an increase in one variable corresponds to an increase in the other variable.
  • It will further tell you how much of an increase in one variable corresponds to an increase in the other variable.
  • Such an association is known as a positive relationship.
  • Pearson’s correlation can also assess whether increases in one variable correspond to decreases in the other variable.
  • The second type of association is known as a negative or an inverse relationship.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When to use Pearson’s correlation

A
  • Use Pearson’s correlation when you have two variables that are scored at the interval or ratio level of measurement (continuous variables).
  • You can also use Pearson’s correlation when you have one variable that has been scored at the interval or ratio level but your second variable is a dichotomous variable. This latter type is also known as a point-biserial correlation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Spearman Rank Order Correlation Coefficient, when to use:

A
  • If one of your two variables has been measured using an ordinal scale with more than two levels then you need to use Spearman rank order correlation coefficient.
  • Spearman is also used when the underlying assumptions of Pearson’s cannot be met. It is a nonparametric alternative.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Pearson’s correlation coefficient Reporting

A
  • The symbol, r (note it is lower case! We reserve uppercase R for regression), is used to represent a Pearson’s correlation coefficient.
  • Pearson’s correlation coefficients range from +1.00 to –1.00.
  • A coefficient of +1.00 represents a perfect positive linear relationship whilst a coefficient of –1.00 represents a perfect negative linear relationship.
  • An r of 0.00 represents an absence of a linear relationship.
  • The main output you need for reporting Pearson’s correlation coefficients is the size and direction of r, and the p value.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

For hypothesis testing, Pearson’s correlation coefficients need to be assessed for statistical significance. What are Cohen’s cut off values.

A
  • Correlation coefficients with magnitudes less than 0.30, irrespective of their statistical significance, are not usually considered very informative or meaningful.
  • Such correlations indicate that less than 10% of the variance is shared between two variables. In other words, correlations, which range between 0.30 and –0.30 indicate that there is a poor relationship between two variables.
  • Note that this correlation of .3 corresponds with the lower bound for what Cohen (1988; 1992) considered a moderate effect (.10-.30 = small effect, .30-.50 = moderate effect, and >.50 = large effect).
  • Pearson’s correlation coefficients are assessed for significance using a t-test.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Significance Test of r

A
  • The t-test tells you whether the association between two variables is a reliably different from r = 0 or whether it is due to chance factors and/or sampling error.
  • However, with correlations the APA convention is not to report the associated t-test.
  • Can also use a z-score approach
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Coefficient of Determination

A
  • The strength of the relationship between two variables is assessed by the coefficient of determination, often given the symbol, r2.
  • The coefficient of determination is simply the squared value of the correlation coefficient.
  • It tells you the amount of variability in one variable that is shared by the other, and can be expressed as a percentage of the total variability.
  • For example, if a correlation coefficient between two variables is 0.50, then 0.25 or 25 percent of the variance in one variable is associated with the variance in the other variable. Another way of saying this is that the two variables share 25 percent of the same information.
  • The coefficient of determination, r2, that we encountered for bivariate correlation is also examined in linear regression, although we use capitals for r in regression (e.g., R and R2 instead of r and r2) in order to distinguish between correlations and regression.
  • In linear regression, R2,is used to determine the proportion of variance in your dependent variable that is predicted from your independent variable(s). If R2 is 0.25 then this means that 25 percent of the variability in the dependent variable,Y, can be predicted form the association withX, the independent variable. The remaining, unpredicted portion, 1-R2, is defined as error variability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bivariate regression

A
  • Use knowledge of one variable to predict scores on a second variable
  • The variable you use to make the prediction is known as the independent variable or the predictor. The variable you are interested in predicting is known as the dependent variable or the criterion.
  • it is not intended to imply a causal relationship
  • the stronger the correlation, the more reliable the prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Bivariate Regression - Y intercept:

A

The Y-intercept, a, also referred to as the “constant”, indicates the value for a dependent variable when the value for an independent variable is equal to zero. It is particularly useful for describing a linear relationship as the Y-intercept marks the point at which the line crosses the vertical axis of the graph. The Y-intercept is also evaluated for statistical significance using the t-test. The t-test assesses whether the Y-intercept is significantly different from zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Factors that affect correlations

A
  • These include the range restriction of either variable X or Y
  • the use of heterogenous samples
  • the distribution of scores on a dichotomous measure and
  • the presence of outliers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Range Restriction

A
  • A range restriction in the sample for any measure usually produces poor correlations between that measure and others, although it can instead produce an inflated correlation or even produce a correlation in the opposite direction to what would be found if you sampled across the whole range of possible values.
  • The possibility of range restriction is indicated by a small standard deviation relative to the possible range of values for the measure.
  • In effect when you have a range restriction you are only examining a part of the relationship between two variables and this can be misleading. Unless the measure is essential to the analysis, it is better deleted.

The following website provides an interactive demonstration of range restriction, and is worth having a look at to solidify this concept in your mind:

(https://emilkirkegaard.shinyapps.io/Understanding_restriction_of_range/)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The use of heterogenous samples

A
  • The use of heterogenous samples can also distort the nature of the relationship between two variables.
  • If you have two distinct groups in your study but you examine the relationship between variables for the whole group then this can also distort the findings.
  • It may lead to higher or lower correlations than if you have examined the two groups separately.
  • When you suspect that the X-Y relationship may differ across known groups in your dataset (e.g., gender), run the correlation separately for each group to check.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Unequal split of responses on a dichotomous measure

A

Another condition that can produce a poor correlation is a very unequal split of responses on a dichotomous measure (e.g., 90% v. 10%). The correlation between that measure and others will be deflated. Once again, the best solution is to delete the measure from further analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The presence of outliers

A
  • The presence of outliers for any measure will also affect correlations between that measure and others.
  • Outliers are scores with extreme values of more than three standard deviations beyond the mean.
  • A strong correlation can be heavily reduced by the presence of one or more outliers.
  • Check that outlier values have been correctly entered and read (i.e., are not missing value indicators).
  • If correct, either delete the case from further analysis, run the model with and without the outlier(s) (to see the extent to which these outliers distort the results and conclusions you draw from these results), or change the value so that it is less extreme.
  • Common methods of adjustment are either to select a replacement value that is one unit more extreme than the value for the most extreme nonoutlier or to replace the value with the raw score equivalent of three Z-scores beyond the mean.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Assumptions of the General Linear Model

A
  • Normality
  • Linearity
  • Data Transformations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Normality:

A

Data for a given variable may deviate from a nice, normal distribution (bell curve shape) as a result of skewness, kurtosis, or both.

Tabachnick and Fidell (2007, see reproduced figure below) show frequency distributions of variables which are respectively normal, skewed, and have kurtosis.

Note that a negatively skewed distribution has the tail toward the left hand side (usually denoting the smaller values on that scale).

Screening the data to ensure that the variables are normally distributed can be accomplished in a variety of different ways.

The most common approaches are:

(1) a significance test comparing our data’s distribution against a normal distribution (this is the Kolmogorov-Smirnov test),
(2) evaluating components of normality (skewness and kurtosis) separately, aided by Z tests, and
(3) visual inspection of histograms.

17
Q

Linearity:

A

The solutions for correlation and multivariate analyses are based upon straight line relationships between the variables.

The solution will not capture any part of a relationship between variables that is not linear.

Screening the data for linearity requires an examination of scatterplots between all pairs of variables.

Nonlinear relationships are evident by departure from an oval distribution of data points on the plot.

Tabachnick and Fidell (2007, p. 78, see reproduced figure below) show U-shaped and J-shaped scatterplots that respectively demonstrate curvilinear and part-curvilinear relationships between variables.

Before considering any action for nonlinearity, first ensure that data are normally distributed by appropriate transformation because non-normality in one variable produces nonlinearity in its relationships with other variables.

Strong nonlinearity after transformation may be handled by converting the offending measure into a dichotomous variable.

Values are allocated to one of two categories by means of a median split.

The rationale for this procedure is that a dichotomous variable can only have a linear relationship with other variables.

Note also that if the non-linear relationship is legitimate (as is likely the case below), transformation of data (to make it fit a linear relationship) may not be the best way to go.

Instead, find an analytic approach that can handle the shape of the relationship.

18
Q

Homogeneity of variance in arrays:

A

Another assumption underlying correlation and regressions when used for hypothesis testing is the homogeneity of variance in arrays or also known as the assumption of homoscedasticity. This is the assumption that “as you go through levels of one variable, the variance of the other should not change” (Field, 2018). This means that the variability of scores for Y are roughly the same for all values of X. This assumption is typically assessed using scatterplots. An example where this assumption is met is given in Field (2018, p. 238 or 2013, p.175). Transformations can also used to correct for any violations of the homogeneity of variance in arrays.

19
Q

Residual Variance

A

Residual variance is error in prediction within the context of regression. It is the difference between the observed score on a DV and the score on the DV as predicted by the regression equation (Y = bX + a).

20
Q

Linear relation as an equation

A

Ŷ = bX + a [alternatively expressed as: yi = b0 + b1Xi]

Where Ŷ is the predicted value of the dependent variable, b is the slope, a is the Y-intercept of the line, andXrefers to each subject’s values for theX or independent variable.