Term 2: Lecture 8 Correlations and Linear Regression Flashcards
Relationships between variables: Whats the difference between
Association and correlation
what are they?
what levels of data are they appropriate for?
Association:
• When two variables are related to one another, in the sense that they vary together
• appropriate for nominal, ordinal, interval or
ratio-level variables.
Correlation:
• A correlation is a linear association between variables: a relationship that can be represented by a straight line.
It can be measured by Pearson’s correlation coefficient (Pearson’s r). Pearson’s r is appropriate for interval and ratio-level
A correlation is…
a linear association between variables: a relationship that can be represented by a straight line.
What can a correlation be measure by? and what level of data is it appropriate for?
It can be measured by Pearson’s correlation coefficient (Pearson’s r). Pearson’s r is appropriate for interval and ratio-level
An association is…
and what level of data is it appropriate for?
When two variables are related to one another, in the sense that they vary together
nominal ordinal interval or ratio levels
waht is pearson’s R?
what does it range between?
what numbers denote perfect postive, perfect negative, and no correlation?
Is a measure of a linear relationship
(correlation) between two variables
• Ranges between -1 and 1 • Tells us how well the data fit a straight line r = 1 → a perfect positive correlation r = –1 → a perfect negative correlation r = 0 → no correlation
The correlation coefficient (r) is a ds (like the mean or the standard deviation).
so what do we need to be careful of?
what is it subject to?
what do we need?
descriptive statistic
Therefore, we need to be careful when drawing conclusions from a correlation coefficient computed from sample data:
the correlation coefficient is subject to random sampling error.
We need a significance test.
what would the null hypothesis of a correlation analysis be?
two variables are linearly independent in the population.
we will be testing is that the correlation
is zero in the population
Correlations: Statistical significance and strength
It is important to distinguish the SS of a correlation from the S of a correlation.
statistical significance
strength
Statistical significance means…..
This says X about the strength of the correlation.
“we have evidence against the
null hypothesis that the correlation is zero in the population”.
nothing
(A correlation may be non-zero, but small)
The following values are often used to evaluate the strength (the effect size) of the correlation coefficient:
Small
Medium
Large
Small .10
Medium .30
Large .50
Confidence Interval for a Correlation Coefficient
It is possible to calculate confidence intervals for a correlation coefficient.
For a 95% confidence interval for the correlation between Psych Distress at 16 and Psych Distress at 34 is (0.080 to 0.355). What are we?
For a given pe, the confidence interval of a Pearson correlation will be X, the X the sample size.
Note: SPSS does not have an automatic function to calculate confidence intervals for Pearson correlations. There are online calculators that can work out confidence intervals given the PE (here: r = 0.222) and the SS (here: n = 184)
We are 95% confident that the interval between 0.080 and 0.355 contains the true correlation coefficient.
point estimate
narrower
larger
Point estimate
Sample size
What are the assumptions of Pearson’s r?
when do we not need to make assumptions? why?
when do we need to make assumptions?
CI ST
what do we assume? BND
we do not make any assumptions about the distribution of the two variables, X and Y, whose correlation we measure.
Don’t need to be normally distributed for Pearson’s r to be a meaningful measure of correlation as it only measures linear relationship between two variables
confidence interval for, or perform a significance test
We then assume that the two variables follow a
bivariate normal distribution. If X and Y are each normally distributed, then their joint distribution will be bivariate normal.
Alternative to Pearson’s r
when would you use?
want to carry out ST
LoD
NND
If X and Y are ordinal measures (rather than interval or ratio), or if either X or Y is not normally distributed, but we want to carry out a significance test
what are spurious correlations?
When a correlation is found between things that have no causal relationship with each other
what are the npe of Pearson’s r
what type of data do they work on?
what do the coeffients vary between?
- Spearman’s correlation coefficient (“Spearman’s rho”)
- Kendall’s tau
Both Spearman’s coefficient and Kendall’s Tau work on ranked data (Spearman’s rho is, in fact, the Pearson correlation coefficient carried out on ranked data).
Spearman’s coefficient and Kendall’s tau vary between -1 and +1, and their interpretation is analogous to Pearson’s r.
what is a regression used for?
What does it assume?
what can a regression not do?
Simple Linear Regression
In regression, we use one variable to predict another.
Linear regression assumes that the relationship may be represented as a straight line.
Regression analysis can help to establish whether a given set of data is consistent with a predictive assumption. However, regression by itself cannot prove that X causes Y
what is a residual?
A residual is the vertical distance of a point from the regression line.
what is the regression line? how does it link to residuals?
The regression line is the line that minimises the squared residuals.
The regression line is therefore called the line of best fit, or least squares regression line.
In a regression what does Y denote?
• The predicted variable, usually denoted by Y, is called the dependent variable, or the outcome.
In a regression what does X denote?
• The predictor variable, usually denoted by X, is called the independent variable. In regression, we model the dependent variable as a function of the independent variable.
Y=bo+(b1X)+ e
What do the letters stand for
Y the predicted variable (DV)
X the predictor variable (IV)
bo = is the intercept of the regression line – also called the constant b1 = is the slope of the regression line
e= is the residual (the prediction error specific to each individual)
From this model we can derive a prediction of __ given __
what is Y hat?
remember that e is not included when
Y X
where (“y-hat”) is the predicted value of Y for individual i.
what does the slope of a regression line
tells us?
how much difference in Y we can predict for a 1 unit
change in X.
Here: for a 1 inch increase in the parents’ height, a child is predicted to be 0.65 inches taller, on average.
What does the Intercept of a regression line tell us?
The intercept is the predicted value of Y when X = 0