Relationship between Variables: Correlation and Regression Flashcards
We are interested in finding a way to represent association between scores.
association
The Regression Line
first and most obvious way to summarize data where we are examining the relationship between two variables.
Scatterplot
The Regression Line
We put one variable on the x-axis and another on the y-axis, and we draw a point for each person showing their scores on the _____
two variables.
The Regression Line
When we want to tell people about our results, we don’t have to draw a lot of _____
scatterplots
Children were asked to listen to a word and repeat it. They were then asked which of these 3 words started with the same sound.
X
Initial phoneme detection
reading score, a standard measure of reading ability.
Y
British Ability Scale (BAS)
We usually summarize and represent the relationship between two variables with a number
correlation coefficient
We also calculate the ______ for this number, and we want to be able to find out if the relationship is statistically significant.
Thus, we want to know what is the _______ of finding a relationship at least this strong if the null hypothesis that there is no relationship in the population is true.
Confidence Intervals
probability
a best fitting line used for prediction.
Line of best fit or Regression Line
Predicting the_____ in Y as a function of the ______ in X.
variation
how steep the line
slope
the position or height of the line.
intercept
By ____ we give the height at the point where the line hits the y-axis.
convention
The height is called the ____or often just the_____. (or sometimes the constant)
y-intercept or intercept
The intercept represents the expected score of a person who scored zero on the ______
x-axis variable.
It is often the case that the intercept doesn’t make any sense. After all, no one usually scores____
scores 0 or close to 0.
We can use the two values of______ to calculate the expected value of any person’s score on Y, given their score on X
slope and intercept
formula for Expected Y score
Expected Y score = intercept + slope x (score on X)
Where x is the x-axis variable. This equation is called the ______
regression equation.
Making Sense of Regression Lines
thinking about the relationship between______ can be very useful.
two variables
Making Sense of Regression Lines
We can make a____ about one score from the another score.
prediction
Problem: if we don’t understand the scale(s), regression lines and equations are _____
meaningless
When there is a relationship between two variables, we can _____ one from the other.
We can not say that one _____the other,
predict
explains
The correlation coefficient
We need some way of making the scales have some sort of meaning, and the way to do this is to convert the data into _____
standard deviation units.
Thus we could ask: “If the score on ___ is one SD higher, how many SDs higher would we expect the ____score to be?”
x
y
Talking in terms of SDs means that we are talking about _____
standardized scores
Because we are talking about standardized regression slopes, we call it______
standardized slope.
Correlation coefficient – a more important name for the ______
standardized slope.
Where σx is the SD of the variable of the variable on the x -axis (the horizontal one) of the scatterplot, and σy is the SD of the variable on the y-axis (the vertical one), and r is the correlation.
The letter r actually stands for ______, but most people ignore that because it is confusing.
regression
if we know the slope we can calculate the correlation using the formula:
r = β x σx / σy
Residual
In correlation, we want to know how well the ______line fits the data
That is, how far away the points are from the line.
regression
The closer the points are to the ____ the stronger the relationship between the two variables. (how do we measure this?)
line
When we had one variable and we wanted to know the spread of the points around the mean, we calculated the____
SD (σ)
The square of the SD is the ____
variance
We can do the same thing with our regression data, but instead of making d the difference between the mean and the score, we can make it the difference between the value that we would expect the person to have, given their score on the x-variable, and the score they actually got. We can calculate their predicted scores, using:
y = b0 + b1x
for each person, we can therefore calculate their predicted BAS reading score, and the difference between their predicted score and their actual score. The difference is called_____
Residual.
the difference between the score they got and the score we thought they would get based on their initial score
residual score
if we want to calculate the equivalent of the variance, we need to ____ each person’s score.
square
The value of the standardized slope and the value of the square root of the proportion of variance explained will___ be the same value.
always
We therefore have two equivalent ways of thinking about correlation.
The first way is the _____. It is the expected increase in one variable, when the other variable increases by 1 SD.
standardized slope
_We therefore have two equivalent ways of thinking about correlation.
The second way is the ______. If you square a correlation, you get the ______ in one variable that is explained by the other variable
proportion of variance
Interpreting Correlations
A correlation is both ____ and ____
descriptive and inferential statistics
We can find the probability estimate and we can also use it to describe the ____
strength of the relationship.
strength of relationship
magnitude
positive, negative, curvilinear etc.
direction
r = 0.1 = small correlation
* r = 0.3 = medium correlation
* r = 0.5 = large correlation
Note that these only really apply in what __, called Social and Behavioral sciences.
cohen’s effect size
Common mistake in interpreting correlations
A correlation around 0.5 is a _____
- A correlation does not have to ____ 0.5 to be large.
- If you have a correlation of r = 0.45, you have a correlation which is approximately ___ to a large correlation.
- It’s not a ______ correlation just because it hasn’t quite reached 0.5
- large correlation
- exceed
- equal
- medium
calculating the correlation coefficient
Also known as Pearson Product moment correlation
Pearson Correlation Coefficient
Pearson correlation coefficient developed by ____
karl pearson
_____ correlation and makes the same assumptions made by other _____ tests.
Parametric
pearson correlation coefficient is _______ data
Continuous and normally distributed data
the moment is the length from the fulcrum multiplied by the weight on the lever.
physics
the total moment is equal to the length from the center, multiplied by the weight.
seesaw analogy
The same principle applies with _____
correlation
We find the length from the center for each of the variables. In this case the center is the _____
mean
So, we calculate the difference between ______ and _____ for each variable (these are the moments) and then we multiply them together (this is the product).
the score and the mean
Because this value is____ on the number of people, we need to divide it by N.
dependent
And because it is related to the _____, we actually divide by N-1
This is called _____, and if we call the two variables x and y
standard deviation
covariance
Finally, finding _____ is laborious, and we do not want to do it more than we have to.
square roots
So instead of finding the square roots and then multiplying them together, it is easier to ______ together, and then find the square root.
multiply the two values
importance scattergraph or plot:
It will show us approximately what the correlation should be. So if it looks strong, ______, and our analysis shows it is -0.60. we have made a mistake.
positive correlation
importance of scattergraph or plot
It will help us detect any____ in our data, for example data entry errors.
errors
importance of scattergraph or plot
- It will help us get a feel of our ____
data
The_____ for a statistic tell us the likely range of a value in the population
confidence intervals
calculating confidence intervals
Sampling distributions of correlation is _____
tricky.
calculating confidence interval
It is not symmetrical, which means we can’t _____ or _____ CIs in the usual way.
add and subtract
calculating the pearson correlation
transformation used which makes the distribution symmetrical.
- Fisher’s z transformation
calculating the pearson correlation
Used to calculate the CIs and then transform back to _____
correlations
It is called a_____ because it makes the distribution of the correlation into a z distribution which is a normal distribution with a mean of 0 and SD of 1.
z transformation
step _
Carry out Fisher’s transformation.
step 1
step _
calculate the Standard Error
step 2
step __
And now the CIs. We use the formula
* CI = z’ + or – zα/2 x se
step 3
Where zα/2 is the value for the ______ which includes the percentage of values that we want to cover.
normal distribution
the value for the 95% confidence is (as always) ____
1.96
Step _
Convert back to correlation.
step 4
If we really want to know the_____ then we can convert the value for r into a value for t.
p-value,
When we know the correlation we can also calculate the position of the _______
regression line.
We can use the two values _______ to create a regression equation which will allow us to predict y _____ from x ______.
- slope and intercept
- (display behavior)
- (desirability)
If variables are both dichotomous (for example, yes/no, top, bottom) we can use the ____
Pearson correlation formula.
f one of your variables is continuous and the other is dichotomous we can use the ___
Point Biserial Formula
This is when one variable is categorical and has just two all-inclusive values.
* Examples: Male/Female, Car owner/Non-Car owner, and so on.
Point Biserial Correlation
Non-Parametric Correlations
* Used when the data do not satisfy the assumptions of the Pearson Correlation because they are not normally distributed or are only ordinal in nature.
Spearman Correlation Kendall Correlation
Three ways to deal with this problem:
Ignore it.
It does not make a lot of difference.
- Use the Pearson Formula on the ranks (although the calculation is harder than the Spearman formula).
- Use a correction
If we use a non-parametric test, such as a _____ we tend to lose power.
Spearman correlation
Although we could be strict and say that rating data are strictly measured at an ordinal level, in reality when there isn’t a problem with the distributions, we would always prefer to use a ____
Pearson Correlation
A_____ gives a better chance of a significant result.
Pearson correlation
A curious thing about the _____ is how to interpret it.
Spearman
We can’t say that it is the_____, that is the relative difference in the SDs, because the SDs don’t really exist as there is not necessarily any relationship between the score and the SD.
standardized slope
We also can’t say that it is the ____ explained, because the variance is a parametric term, and we are using ranks.
proportion of variance
All we can really say about the Spearman is that it is the Pearson correlation between the ___
ranks
alternative nonparametric correlation, which does have a more sensible interpretation. (advantage: meaningful interpretation)
Very rarely used however.
Kendall’s Tau-a (τ – Greek Letter)
Kendall’s Tau-a is rarely used for two reasons:
Difficult to calculate if you do not have a computer.
* It is always lower than a spearman correlation, for the same data (but the p-values are always exactly the same).
* Because people like their correlations to be high, they tend to use it less.
The fact that two variables correlate does not mean there is a causal relationship between them.
* Though it is often very tempting to believe that there is.
Correlation and Causality
- Correlation does not mean____, but ____ does mean correlation
causality
In general, if one variable is a purely category-type measure, then correlation cannot be carried out, unless the variable is _____
DICHOTOMOUS
Correlation is also a measure of association between_____
two variables
What we can do with a nominal/categorical data is reduce the measured variable to nominal level and conduct a _____on the resulting frequency table.
chi-square test
A lack of relationship is signified by a value close to ___
zero
A value of zero however could occur for a ____
curvilinear relationship.
Strength is a measure of the____
correlation.