Stats - correlation and regression Flashcards
What is correlation used for?
Correlation is used to test for association between variables (e.g. whether salary and IQ are related).
How is regression used, and how does it relate to correlation?
Once correlation between two variables has been shown, regression can be used to predict values of other dependent variables from independent variables. Regression is not used unless two variables have firstly been shown to correlate.
What are the 3 basic categories of correlation?
Linear
Non-linear
No correlation
Variables that are correlated through a linear relationship can display either positive or negative correlation.
What is the difference between these two?
Positively correlated variables vary directly (as one increases so does the other).
Negatively correlated variables vary as opposites (as the value of one variable increases the other decreases).
How do you measure the strength of correlation?
The strength of the association can be estimated by observing a scatter graph of the variables. The correlation type is independent of the strength.
It can be strong/ moderate/ weak.
How do you measure the strength of a linear relationship?
What symbols are given to the sample and the population correlation coefficients?
Correlation coefficient (Pearson’s correlation coefficient).
The sample correlation coefficient is given the symbol r.
The population correlation coefficient has the symbol ρ (rho).
The sign of the correlation coefficient tells us the direction of the linear relationship. How do positive and negative correlations appear?
If r is negative (<0) the correlation is negative and the trend line slopes down. If r is positive (> 0) the correlation is positive and the trend line slopes up.
The size (magnitude) of the correlation coefficient tells us the strength of a linear relationship.
What value does r have in a:
1) very strong linear association
r = 0.8-1
The size (magnitude) of the correlation coefficient tells us the strength of a linear relationship.
What value does r have in a:
2) strong correlation
0.6-0.79
The size (magnitude) of the correlation coefficient tells us the strength of a linear relationship.
What value does r have in a:
3) moderate correlation
0.4-0.59
The size (magnitude) of the correlation coefficient tells us the strength of a linear relationship.
What value does r have in a:
4) weak correlation
0.2-0.39
The size (magnitude) of the correlation coefficient tells us the strength of a linear relationship.
What value does r have in a:
5) very weak linear association
0-0.19
Parametric statistic procedures rely on assumptions about the shape of the distribution.
What 3 characteristics do parametric data assume?
1) normal distribution
2) measured on an interval/ ratio scale
3) conditions or groups have equal variance
How is a complete absence of correlation expressed?
0
How do we summarise correlation using:
a) parametric variables
b) non-parametric variables
What are the symbols for:
c) the samples
d) the population
a) Pearson’s
b) Spearman’s rank
c) parametric - r, non-parametric - rs
d) p (for both)
What is linear regression?
In contrast to the correlation coefficient, linear regression may be used to predict how much one variable changes when a second variable is changed. A regression equation may be formed, y = a + bx, where
y = the variable being calculated (predicted value of response variable)
a = the intercept value (value of y when x = 0)
b = the slope of the line or regression coefficient. Simply put, how much y changes for a given change in x
x = the second variable
What kind of graph is used in correlation and regression analysis?
What goes on the x and y axis?
Scatter graphs
They assist in determining, visually, if variables are associated. They may also show the nature of a relationship. They can also assist in determining if there are any outliers that may be effecting the distribution.
X-axis = independent variable
Y-axis = dependent variable
What are the:
A) dependent variable
B) independent variable
Dependent variable: The variable being measured in an experiment, which depends on the changes in the independent variable (Y axis)
Independent variable: The variable that is manipulated or controlled by the experimenter to observe its effect (x axis)
What type of regression is used with dichotomous variables (i.e binary outcomes like employed vs unemployed)?
Logistic regression is a statistical method for analysing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). In other words, it predicts the probability of occurrence of an event by fitting data to a logistic function. Hence, it is also known as logistic regression. Since its outcome is binary, it can be used to model the likelihood of a disease or health condition occurring.
It does not assume a relationship between the variables, as in linear regression.