Topic 5: Linear Regression Flashcards
Given bivariate data, what are the steps for a linear regression framework?
1) Produce scatter plot
2) Produce regression line
3) Calculate correlation coefficient
4) Produce residual plot
5) Check assumptions
6) Perform predictions
What does checking assumptions involve? What happens if the assumptions are true?
CHecking scatterplot to look linear
Ensure residual plot looks random
If these assumptions are true, the linear model is appropriate for use
What does a residual plot look at?
Looks at gaps between linear regression line and the different points
WHat is a scatter plot?
It is the graphical summary of 2 variables on the same plane, resulting in a cloud of points
What is linear association between 2 variables?
Describes how tightly the points cluster around a line
What are strong and weak associations?
Strong: Cloud of points tightly clustered around a line
Weak: Points aren’t tightly clustered around the line
What is a positive and negative association?
Positive association is when one variable increases, another increases as well
Negative association is when one variable increases, another decreases
What are the 5 things that a scatter plot can be summarised by
mean of x
mean of y
sd of x
sd of y
correlation coefficient (r)
What is the centre of the cloud represented by?
By the point of averages (mean of x, mean of y)
What is the horizontal spread of cloud measured by?
sd of x
What is the vertical spread of cloud measured by
sd of y
What is the correlation coefficient?
A numerical summary which measures clustering around a line. Indicates sign and strength of linear association. It is between -1 and 1
What is population correlation coefficient?
Mean of the product of variables in standard units
How does r (correlation coefficient) measure association?
r divides scatter plots into 4 quadrants, at the point of averages (centre)
Majority of points in the upper right and lower left quadrants –> overall positive r
Majority of points in the upper left and lower right quadrants –> overall negative r
What are some properties of correlation coefficient (r)
It is a pure number (no units)
lies between -1 and 1
r = 0 occurs when points dont fit around a line but could still happen in multiple ways just not linearly
Correlation coefficient isn’t affected by interchanging the variables (switching x and y axis)
Correlation coefficient is shift and scale invariant (doesn’t change with different shifts to the graph or different extent of scaling)
What are the two options for a line which represents relationship between two variables?
SD line and Regression lineW
What is the SD line and why isn’t it preferred
Connects points of averages (mean of x, mean of y) to (mean of x + sd of x, mean of y + sd of y) (for r > 0)
OR
(mean of x, mean of y) to (mean of x + sd of x, mean of y - sd of y) (for r < 0)
It isn’t preferred because it is insensitive to amount of clustering around the line and thus underestimates (LHS) and overestimates(RHS) at the extremes
What is the regression line and why is it a better option
Connects the point of averages to (mean of x + SD of x, mean of y + r * SD of y)
Accounts for extremes and clustering through use of the correlation coefficient
What is the point of averages
coordinate of (mean of x, mean of y)
WHat is the graph of averages
Plots the average y for each x value
Regression line is a smoothed out version of a graph of averages
What is a residual?
vertical distance or ‘gap’ of a point above or below the regression line
Represents error between actual value and prediction
What do residual plots graph?
Graphs residuals vs x
How do we know if linear fit is appropriate based on residual plots?
There shouldm’t be a pattern –> random, because it shows variance is constant, and if not residuals aren’t random and violates the assumptions
What are 11 common mistakes of regression
1) r doesn’t mean percentile. I.e. r = 0.8 doesn’t mean 80% of points clustered around line
2) r = 0.8 doesn’t mean that points are twice as tightly clustered than r = 0.4
3) Outliers can overtly influence correlation coefficient
4) Nonlinear associations can’t be detected by correlation coefficients
5) The same correlation coefficient can arise from very different data –> still need to be careful
6) Rates of averages tends to inflate correlation coefficient. I.e. a line between the two variables which are group means tends to overestimate strength of association between the two variables
7) Association doesn’t mean causation
8) Small SDs can make correlation look bigger
9) Beware of extrapolating beyond the range of the regression line
10) A high correlation coefficient that fits regression line might not even have data which is linear
11) Beware of refitting – even though correlation coeff might be same if x and y are switched, we need to refit the model depending on what fits the context
What is the prediction error?
Difference between the line of regression (predicted value) and a certain point (actual value)
What can be used to measure prediction error?
RMS error. It represents the average gap between the points and the regression line. I.e. ‘standard deviation for the line’
What is the formula for RMS error (pop)
root of (mean of (gaps) ^2)
= root (1 - r^2 x SD of y)
What are some important notes of RMS error?
Perfect correlation ( r = -1, 1) –> RMS error = 0
r = 0 –> RMS error = SD of y
If we use mean of y for any x (baseline prediction) –> RMS error = SD of y
To calculate RMS errors for sample vs population, only difference is multiplication of popsd for population compared to sd for sample
What are 4 methods of making predictions
Baseline prediction
Prediction in strip
Regression line
Predicting percentile ranks
What does baseline prediction involve?
Given a certain value of x, basic prediction of y would be the avg of y over all x values in the data
What does prediction in strip involve?
Average of all y values in data corresponding to that x value
What does prediction through regression line involve?
Use the given equation for regression line to predict y
What does prediction through percentile ranks invovle?
If x is at a certain percentile of all x’s, find the percentile which we coiuld predict y to be in
Steps:
Find z score in x direction (Zx)
FInd predicted z score in y direction ( = r x Zx)
Translate z score in y direction back to percentile in y direction
What does homoscedastic mean?
Variance of residual or error term is constant
What does heteroscedastic mean
Unequal variance of residual
How do we know if something is homoscedastic
You can tell if a regression is homoskedastic by looking at the ratio between the largest variance and the smallest variance. If the ratio is 1.5 or smaller, then the regression is homoskedastic.
If vertical strips on a scatter plot show equal spread in the y direction it is homoscedastic
RMS error can be used as a measure of spread for individual strips
How do we know if something is heteroscedastic
If vertical strips dont show equal spread on a y direction in a regression line (or the residual plot)
What are the implications of homoscedasticity?
Normal approximation can be used within the vertical strips
How do we get the normal distribution of the strip?
New mean of y in new strip = mean of y + r x (Z score of x) x (SD of y)
New SD of y = RMS error (assume population)