Chapter 8: Correlation Flashcards
covariance
- simple measure of association
- we want to see if two variables are related/associated (do they covary)
- if two variables are related, we should expect deviations on one variable to be met with deviations on another
- positive covariance means two wariables have a positive relationship. a negative covariance means the two variables have a negative relationship. a covariance of 0 indicates no relationship
- the covariance between 2 variables is heavily influenced by the units of measurement and is not easily interpretable
standardization of the covariance
Pearson correlation coefficient (r)
- pearson’s r is the covariance standardized
- it can be obtained by dividing the covariance by the product of the two SDs
Pearson’s r
- a measure of linear association between 2 variables
- range from +1.00 to -1.00
- .1 is small, . 3 is medium, and .5 is large (guidelines
r squared
- shared variance between two variables
- just square r
- interpretation: 25% of the variance in the outcome can be accounted for by the variance in the predictor
- can inform judgments about practical and scientific significance
curvilinear relationship
an observed curvilinear relationship may be due to a ceiling or floor effect, so consider this possibility
- ceiling effect: independent variable no longer has an effect on the dependent variable
factors that influence the observed r
- sampling error
- measurement error
- range restriction (direct, indirect, self-selection)
sampling error
statistic - parameter
- occurs because we have samples, not the whole population
- r could be lower or higher than rho
- correlation in the sample is actually a biased estimated of rho
- affected by sample size
measurement error
true value - actual value
- decreases the observed correlation, r
- possible to correct for if certain assumptions are met
- shorter tests have more measurement error
range restriction
- occurs when you have reduced variability in your sample, often as a result of using cutoff scores
- full range of values or a variable not present in the sample
- decreases the observed correlation, r. it underestimates the utility of using that selection instrument
- three types: direct, indirect, self selection
range restriction types
- direct: occurs when applicants are selected on X (variable of interest)
- indirect: occurs when applicants are selected on a third variable, Z, that is correlated with X (i.e. ACT/SAT)
- self-selection: occurs when people selectively do not apply for positions they believe they are not qualified for (i.e., harvard only takes high SAT so people w/ low SAT score aren’t going to apply, only leaves the people in the upper range, reduces variability)
units of analysis
individual vs. group
- associations at group and individual levels are different because the processes that are driving improvement are different
- if you assume an association at one unit of analysis is going to hold across another unit of analysis, this is a fallacy
- atomistic fallacy: concluding that an association at individual level must also exist at the group level
- ecological fallacy: concluding that an association at the group level must also exist at the individual level
alternative measures of association
- Spearman’s rho: non parametric statistic used w/ skewed data and many outliers. used to minimize the effects of extreme scores and violations of assumptions
- Kendall’s tau: non parametric statistic used to minimize the effects of extreme scores and violations of assumptions. used when you have a small data set and a large number of tied ranks
- biserial correlation: used when one continuous variable is artificially dichotomized (makes r smaller). corrects for artificial dichotomy and estimates the correlation had the variable been measuired continuously. needs at least 100 observations. a lot of info is lost
- tetrachoric correlation: used when both variables are artificially dichotomized. needs at least 400 observations. estimates what r would be if variables had been properly measured on an interval or ratio scale
why does correlation not equal causation?
to determine that X causes Y, three conditions must be met:
- X precedes Y in time (temporal precedence)
- there is an association between X and Y
- alternative explanations for the association between X and Y are ruled out
spurious correlations
- if there is no causal relationship between X and Y, but X and Y correlate, the correlation is said to be spurious
- often caused by a third variable ( a variable that causes both X and Y)
- mismatch between correlations and causal relations is possible. correlation can be positive when the real relationship is negative (can happen when looking between units)
inferences about rho and CIs
- the higher rho is, the more negatively skewed it cbecomes
- the higher rho is, the more the estimates (r) are underestimates of rho
- for anything not rho = 0, the correlation coeff tends to be biased (underestimates)
- the higher rho is, the greater the bias is
- as N increases, the more precise the estimates become
- r is a consistent estimator: with higher N, you get more and more precise estimates of the population value. this is okay because it can be overcome by collecting bigger sample size
- skewness for all values of rho except 0 causes issues about making inferences about rho (difficult to make because the CIs are not normal)
- fisher r to z transformation: used to transform observed r into a z and place limits around the z. this extends out the tail to make it normal. these limits are transformed back into correlation coefficients to give CIs around r
r is a biased and consistent estimator of rho