[L8] Relationship between Variables Correlation and Regression Flashcards
We are interested in finding a way to represent ___
between scores.
association
Types of Correlation
Bivariate; Multivariate Correlation
Correlation does not prove __
Causality
Multivariate Correlation have more ____ Validity
Ecological
IGT & RMT = test of difference
Correlation = test of _
_
_
correlation/association
___ – first and most obvious way to summarize
data where we are examining the relationship between
two variables
Scatterplot
We put one variable on the x-axis and another on the yaxis,
and we ___for each person showing their
scores on the two variables.
draw a point
test of correlation involved administering ___ tests in the same group of participants
2 or more different
When we want to tell people about our results, we ____
don’t
have to draw a lot of scatterplots.
__
_
_Children were asked to listen
to a word and repeat it. They were then asked which of
these 3 words started with the same sound.
Initial phoneme detection.
____reading score, a standard
measure of reading ability.
British Ability Scale (BAS)
We usually summarize and represent the relationship
between two variables with a ___
__
_
_
number (correlation
coefficient).
We also calculate the ____ for this
number, and we want to be able to find out if the
relationship is ___
Confidence Intervals; statistically significant
Thus, we want to know what is the probability of finding
a relationship at least this strong if the ____ that
there is no relationship in the population is true.
null hypothesis
– a best fitting line
used for prediction
Line of best fit or Regression Line
Predicting the variation in Y as a __
_
function of the variation
in X.
– how steep the line
*
Slope
___ – the position or height of the line.
Intercept
By convention we give the height at the point where the
line ___
hits the y-axis.
The
height is called the____or often just the
intercept
y-intercept ; (or sometimes the constant)
The intercept represents the ___of a person
who scored _
_ on the x-axis variable.
expected score ; zero
y=b0+b1X
regression expression, predicting behavior of y as function of x
useful for raw scores
It is often the case that the intercept __. After all, __no one_usually scores ___
doesn’t make any
sense; 0 or close to 0.
We can use the ___of slope and __ to
calculate the expected value of any person’s score on Y,
given their score on X.
two values, intercept
y = β0 + β1x (sometimes it is y = a + bx or y = mx + c)
Where x is the x-axis variable. This equation is called the
___
regression equation.
We can make a _
__ about one score from the
another score
prediction
Problem: if we don’t understand the ___, regression
lines and equations are ___.
scale(s), meaningless
thinking about the relationship between two variables can
be very useful
Making Sense of Regression Lines
When there is a relationship between two variables, we
can ___ one from the other.
predict
We can not say that one __ the other,
explains
We need some way of making the scales have some sort
of meaning, and the way to do this is to
__ the data
into __
convert; standard deviation units.
Talking in terms of SDs means that we are talking about
_
__
standardized scores.
Because we are talking about standardized regression
slopes, we call it “___
standardized slope.
___ – a more important name for the
standardized slope.
Correlation coefficient
In order to convert the units, we need to know the ___
SD of
each of the measures.
If we know the ___, we can calculate the correlation
using the formula: r = β x σx / σy
slope
The letter r actually stands for ___, but most people
ignore that because it is confusing
regression
Thus, if we know the _
__ we can calculate the correlation
slope
3 ways to calculate for the correlation coefficient”r”
- regression line
- standardized slope
- proportion of variance
In correlation, we want to know how well the regression
line ___
fits the data.
That is, how
___the points are from the line.
far away
The __ the points are to the line, the stronger the
relationship between the two variables.
closer
When we had one variable and we wanted to know the
spread of the points around the mean, we calculated the
_
_
SD (σ).
The square of the SD is the _
__.
variance
We can do the same thing with our regression data, but
instead of making d the difference between the mean and
the score, we can make it the difference between the value
that we would expect the person to have, given their score
on the x-variable, and the score they actually got. We can
calculate their ___
predicted scores,
the difference between their
predicted score and their actual score. The difference is
called
_–.
Residual
Their ____ (the difference between the score they
got and the score we thought they would get based on
their initial phoneme score)
residual score
if we want to calculate the equivalent of the
variance, we need to ___ each person’s score
square
___ = d squared
Residual squared
The value of the standardized slope and the value of the
square root of the proportion of variance explained will
___ be the same value.
always
We therefore have ___of thinking about
correlation.
two equivalent ways
The first way is the ___
It is the expected
increase in one variable, when the other variable increases
by 1 SD.
standardized slope.
The second way is the __
__ If you
square a correlation, you get the proportion of variance in
one variable that is explained by the other variable.
proportion of variance.
A correlation is both ___statistics.
descriptive and inferential
We can find the
____and we can also use
it to describe the ___
probability estimate ; strength of the relationship
- __ – strength of relationship
_
Magnitude
___ – positive, negative, curvilinear etc.
Direction
Cohen’s effect size:
- r = 0.1 = small correlation
- r = 0.3 = medium correlation
- r = 0.5 = large correlation
Note that these only really apply in what Cohen, called
___
Social and Behavioral sciences.
Common mistake
- A correlation around 0.5 is a large correlation.
- A correlation does not have to exceed 0.5 to be large.
- If you have a correlation of r = 0.45, you have a
correlation which is approximately equal to a large
correlation. - It’s not a medium correlation just because it hasn’t quite
reached 0.5
Pearson Correlation Coefficient
* Also known as ___
Pearson Product moment correlation.
Pearson Product moment correlation developed by
Karl Pearson
Pearson Correlation Coefficient
is a _____ and makes the ___
made by other parametric tests.
Parametric correlation; same assumptions
level of measurement for Pearson Correlation Coefficient
Continuous and normally distributed data
to determine r
- standardized slope
- proportion of variance
- pearson product moment correlation
Optional Extra: Product Moments
* ___: the moment is the __ from the fulcrum
multiplied by the weight on the lever.
Physics; length
___ the total moment is equal to the length
from the center, multiplied by the weight. The same principle applies with ___.
Seesaw analogy: correlation
The same principle applies with correlation: needs to be balanced (raw to standard score) to be
_-
_
comparable
- We find the _
__ for each of the
variables. In this case the center is the
__.
length from the center; mean
So, we calculate the difference between the score and the
mean for each variable (these are the ___) and then
we multiply them together (this is the ___).
moments; product
Because this value is dependent on the ___
we need to divide it by N.
number of people,
And because it is related to the ___, we
actually divide by N-1.
standard deviation
This is called ___, and if we call the two variables
x and y,
covariance
Just as before, we need to __ this value by
dividing by the ___
standardize; standard deviations.
Calculating the Correlation
Coefficient:
we need to divide by ___, so we
multiply them together
both SDs
So instead of finding the square roots and then
multiplying them together, it is easier to multiply the two
values together, and then find the ____
square root.
Importance scattergraph or plot:
*
It will show us approximately what the correlation should
be.
It will help us detect any errors in our data, for example
data entry errors.
It will help us get a feel of our data.
The confidence intervals for a statistic tell us the likely
___
range of a value in the population.
Sampling distributions of correlation is ___. It is not __
_, which means we can’t add and
subtract CIs in the usual way.
tricky; symmetrical
___transformation used which
makes the distribution symmetrical.
Fisher’s z transformation –
Used to calculate the CIs and then transform back to
correlations.
Fisher’s z transformation –
It is called a ____, because it makes the
distribution of the correlation into a z distribution which
is a normal distribution with a mean of 0 and SD of 1.
z transformation
There are ___ to find the p-value associated with a
correlation.
2 ways
Calculating the p-value
- Use table in Appendix 3.
If we really want to know the p-value, then we can
convert the value for r into a ___
value for t.
We can use this t-value to obtain the __using
a __
exact p-value, computer program
When we know the __ we can also calculate the
___ of the regression line
correlation; position
We can use the two values ___ to create
a regression equation which will allow us to predict y
(display behavior); (). from x desirability
(slope and intercept);
We can use the
___ to draw a graph with the line
of best fit on it
predictions
we have extended the line to __ – we would not
normally do this.
zero
If variables are both dichotomous (for example, yes/no,
top, bottom) we can use the ___
Pearson correlation formula.
Dichotomous Variables - A much easier way is to calculate the value of__
and then use the ___, which will
give the same answer as using the r correlation.
chi-square; phi ( ф ) correlation formula
The p-value of the correlation will be the same as the pvalue
for the ___because the two tests are just
different ways of thinking about the same thing.
chi-square test
If one of your variables is continuous and the other is
dichotomous we can use the ___
Point Biserial Formula:
*These formulae give exactly the same answer as the
__
_, but they just easier to use.
Pearson Formula
On special occasions we can correlate using a
__ This is when one variable is categorical and has just two
all-inclusive values.
dichotomous variable.
Here, we may give an ___according to
membership of the categories, e.g. 0 for female and 1 for
male.
arbitrary value
Dichotomous Variables- We then proceed with the ___ as usual.
Pearson Correlation
The Point Biserial is written as
rpb.
This value can be turned into an ordinary___
t-value.
This may sound like a cheat because the Pearson’s was a
____ type of statistic and that the level of
measurement should be at least interval.
parametric
This is true only if we want to make ___
from our results about underlying populations
certain assumptions
- Used when the data do not satisfy the assumptions of the
Pearson Correlation because they are not normally
distributed or are only ordinal in nature
Non-Parametric Correlations
Non-Parametric Correlations (2)
Spearman Correlation
Kendall Correlation
Two ways to find the Spearman Correlation.
calculate the Spearman correlation
steps of spearman correlation
- Draw a scatterplot
- Rank the data in each group separately
- Find the difference between the ranks for each
person. Which we call d. - use the Formula for Spearman
The first way to calculate the Spearman correlation is just
to calculate a (Pearson) correlation, using the ___
ranked data.
The problem is that the Pearson Formula is a bit __,
especially if a computer is not used.
fiddly
A simplification of the Pearson Formula is available,
developed by Spearman, which works in the case where
there are ___.
ranks
Find the __ for each
person. Which we call d.
difference between the ranks
The ____ is on a scale, however it is not on a scale
we understand
d-score total
We need to convert the scale into one that we do
understand such as the
____, which goes from
-1.00 to +1.00.
correlation scale
However, there is a slight complication because the
formula as we have given it is only valid when there are
___
no ties in the data.
Three ways to deal with this problem:
- Ignore it. It does not make a lot of difference.
- Use the Pearson Formula on the ranks (although the
calculation is harder than the Spearman formula). - Use a correction. (book suggests a site)
We calculate the significance of the Spearman in the same
way as the significance of the __ Correlation.
Pearson
___– not at all straightforward or easy to
calculate.
Confidence Intervals
If we use a non-parametric test, such as a __
correlation, we tend to lose power.
Spearman
By converting data to ___ information about the actual
scores are thrown away
ranks
Although we could be strict and say that rating data are
strictly measured at an ordinal level, in reality when there
isn’t a problem with the distributions, we would always
prefer to use a ___
Pearson Correlation.
A ___ correlation gives a better chance of a
significant result
Pearson
A curious thing about the Spearman is ___
how to interpret it.
We can’t say that it is the +____, that is the
relative difference in the SDs, because the SDs don’t
really exist as there is not necessarily any relationship
between the score and the SD.
standardized slope
We also can’t say that it is the ___
explained, because the variance is a ___ term, and
we are using ranks
proportion of variance; parametric
All we can really say about the Spearman is that it is the
___
Pearson correlation between the ranks.
Non-Parametric Correlations
Spearman Rank Correlation Coefficient
Spearman Rank Correlation Coefficient - shows how closely the ___ are related.
ranked data
___ – alternative nonparametric
correlation, which does have a more sensible
interpretation. (advantage: meaningful interpretation)
Kendall’s Tau-a (τ – Greek Letter)
Very rarely used however.
Kendall’s Tau-a
Kendall’s Tau-a is rarely used for two reasons:
- Difficult to calculate if you do not have a computer.
- It is always lower than a spearman correlation, for the
same data (but the p-values are always exactly the same).
Kendall’s Tau-a - Because people like their correlations to be ___, they
tend to use it less.
high
The fact that two variables correlate does not mean there
is a ___ relationship between them.
causal
Correlation does not mean causality, but ___ does
mean correlation.
causality
Correlation when one variable is categorical
Chi-square test
In general, if one variable is a purely category-type
measure, then correlation cannot be carried out, unless
the variable is ___.
DICHOTOMOUS
- We can however use the Chi-square test since it is called
a ____
test of association.
___ is also a measure of ___ between two
variables
Correlation; association
What we can do with a nominal/categorical data is
reduce the measured variable to ___ level and
conduct a ___ test on the resulting frequency
table.
nominal; chi-square
This is only possible, however, where you have gathered
several ___ in each category.
cases
We can find the ___ for the measured data and
record how many responses/frequencies were ___
overall mean; above and
below this mean
*Typical variables that cannot be correlated (unless a
rational attempt to order categories is made) are: marital
status, ethnicity, place of residence, handedness,
sexuality, degree subject and so on.__
*Typical variables
A lack of relationship is signified by a value __
_
close to
zero.
A value of zero however could occur for a___
curvilinear
relationship.
___ is a measure of the correlation.
Strength