Topic 5: Linear Regression Flashcards
Given bivariate data, what are the steps for a linear regression framework?
1) Produce scatter plot
2) Produce regression line
3) Calculate correlation coefficient
4) Produce residual plot
5) Check assumptions
6) Perform predictions
What does checking assumptions involve? What happens if the assumptions are true?
CHecking scatterplot to look linear
Ensure residual plot looks random
If these assumptions are true, the linear model is appropriate for use
What does a residual plot look at?
Looks at gaps between linear regression line and the different points
WHat is a scatter plot?
It is the graphical summary of 2 variables on the same plane, resulting in a cloud of points
What is linear association between 2 variables?
Describes how tightly the points cluster around a line
What are strong and weak associations?
Strong: Cloud of points tightly clustered around a line
Weak: Points aren’t tightly clustered around the line
What is a positive and negative association?
Positive association is when one variable increases, another increases as well
Negative association is when one variable increases, another decreases
What are the 5 things that a scatter plot can be summarised by
mean of x
mean of y
sd of x
sd of y
correlation coefficient (r)
What is the centre of the cloud represented by?
By the point of averages (mean of x, mean of y)
What is the horizontal spread of cloud measured by?
sd of x
What is the vertical spread of cloud measured by
sd of y
What is the correlation coefficient?
A numerical summary which measures clustering around a line. Indicates sign and strength of linear association. It is between -1 and 1
What is population correlation coefficient?
Mean of the product of variables in standard units
How does r (correlation coefficient) measure association?
r divides scatter plots into 4 quadrants, at the point of averages (centre)
Majority of points in the upper right and lower left quadrants –> overall positive r
Majority of points in the upper left and lower right quadrants –> overall negative r
What are some properties of correlation coefficient (r)
It is a pure number (no units)
lies between -1 and 1
r = 0 occurs when points dont fit around a line but could still happen in multiple ways just not linearly
Correlation coefficient isn’t affected by interchanging the variables (switching x and y axis)
Correlation coefficient is shift and scale invariant (doesn’t change with different shifts to the graph or different extent of scaling)
What are the two options for a line which represents relationship between two variables?
SD line and Regression lineW