regression and correlation Flashcards
What is correlation?
quantifies the strength of the association between two quantitative variables. Pearson’s correlation coefficient is a measure of the scatter of the points underlying a linear trend between two quantitative variables
What is linear regression?
studies the linear relationship between two quantitative variables when one us modelled depending on the other. The model allows us to make specific predictions about what we expect to see among individuals who have not had the dependent variable measured.
Describe P
P measures the strength of the linear relationship between two variables on a scale from -1 to 1. where P=+1 is a perfect positive linear relationship
P=0 means no relationship
P=-1 perfect negative relationship
what is 100rsquared?
the percentage variability of X or Y which is explained by the relationship between them
What is the difference between r and p?
r is the sample correlation and p is the estimated popuation correlation.
P= estimate +/- (1.96XSE)
What are some important points about correlation?
The correlation coefficient is not dependent on the units of measurement of the variables
Should always look at the data first to make sure there is some sort of linear relationship
What can be done if the sample is not normally distributed before calculating the Pearson coefficient?
transforming the data (by taking logs)
using a rank correlation coefficient such as Spearman’s
What is the best fitting line in regression?
the one which makes the sum of the squares of the residuals as small as possible - the equivalent to minimising the variance of the residuals and the line is known as the least squares linear regression line
What are the assumptions for linear regression?
CLINE Constant variance Linearity Independent observations Normality of residuals Error free values for x
In the residual plots, what do graphs A and B assess?
The assumptions that the residuals are normally distributed
What does graph C assess?
Whether the relationship is linear and whether the spread of response is the same for all values of x
What does graph D show?
Some indication of a lack of independence