Topic 5: Linear Regression Flashcards by James Makmur

Given bivariate data, what are the steps for a linear regression framework?

1) Produce scatter plot

2) Produce regression line

3) Calculate correlation coefficient

4) Produce residual plot

5) Check assumptions

6) Perform predictions

How well did you know this?

Not at all

Perfectly

What does checking assumptions involve? What happens if the assumptions are true?

CHecking scatterplot to look linear

Ensure residual plot looks random

If these assumptions are true, the linear model is appropriate for use

How well did you know this?

Not at all

Perfectly

What does a residual plot look at?

Looks at gaps between linear regression line and the different points

How well did you know this?

Not at all

Perfectly

WHat is a scatter plot?

It is the graphical summary of 2 variables on the same plane, resulting in a cloud of points

How well did you know this?

Not at all

Perfectly

What is linear association between 2 variables?

Describes how tightly the points cluster around a line

How well did you know this?

Not at all

Perfectly

What are strong and weak associations?

Strong: Cloud of points tightly clustered around a line

Weak: Points aren’t tightly clustered around the line

How well did you know this?

Not at all

Perfectly

What is a positive and negative association?

Positive association is when one variable increases, another increases as well

Negative association is when one variable increases, another decreases

How well did you know this?

Not at all

Perfectly

What are the 5 things that a scatter plot can be summarised by

mean of x
mean of y
sd of x
sd of y
correlation coefficient (r)

How well did you know this?

Not at all

Perfectly

What is the centre of the cloud represented by?

By the point of averages (mean of x, mean of y)

How well did you know this?

Not at all

Perfectly

What is the horizontal spread of cloud measured by?

sd of x

How well did you know this?

Not at all

Perfectly

What is the vertical spread of cloud measured by

sd of y

How well did you know this?

Not at all

Perfectly

What is the correlation coefficient?

A numerical summary which measures clustering around a line. Indicates sign and strength of linear association. It is between -1 and 1

How well did you know this?

Not at all

Perfectly

What is population correlation coefficient?

Mean of the product of variables in standard units

How well did you know this?

Not at all

Perfectly

How does r (correlation coefficient) measure association?

r divides scatter plots into 4 quadrants, at the point of averages (centre)

Majority of points in the upper right and lower left quadrants –> overall positive r

Majority of points in the upper left and lower right quadrants –> overall negative r

How well did you know this?

Not at all

Perfectly

What are some properties of correlation coefficient (r)

It is a pure number (no units)

lies between -1 and 1

r = 0 occurs when points dont fit around a line but could still happen in multiple ways just not linearly

Correlation coefficient isn’t affected by interchanging the variables (switching x and y axis)

Correlation coefficient is shift and scale invariant (doesn’t change with different shifts to the graph or different extent of scaling)

How well did you know this?

Not at all

Perfectly

What are the two options for a line which represents relationship between two variables?

SD line and Regression lineW

How well did you know this?

Not at all

Perfectly

What is the SD line and why isn’t it preferred

Study These Flashcards

Connects points of averages (mean of x, mean of y) to (mean of x + sd of x, mean of y + sd of y) (for r > 0)

(mean of x, mean of y) to (mean of x + sd of x, mean of y - sd of y) (for r < 0)

It isn’t preferred because it is insensitive to amount of clustering around the line and thus underestimates (LHS) and overestimates(RHS) at the extremes

What is the regression line and why is it a better option

Study These Flashcards

Connects the point of averages to (mean of x + SD of x, mean of y + r * SD of y)

Accounts for extremes and clustering through use of the correlation coefficient

What is the point of averages

Study These Flashcards

coordinate of (mean of x, mean of y)

WHat is the graph of averages

Study These Flashcards

Plots the average y for each x value

Regression line is a smoothed out version of a graph of averages

What is a residual?

Study These Flashcards

vertical distance or ‘gap’ of a point above or below the regression line

Represents error between actual value and prediction

What do residual plots graph?

Study These Flashcards

Graphs residuals vs x

How do we know if linear fit is appropriate based on residual plots?

Study These Flashcards

There shouldm’t be a pattern –> random, because it shows variance is constant, and if not residuals aren’t random and violates the assumptions

What are 11 common mistakes of regression

Study These Flashcards

1) r doesn’t mean percentile. I.e. r = 0.8 doesn’t mean 80% of points clustered around line

2) r = 0.8 doesn’t mean that points are twice as tightly clustered than r = 0.4

3) Outliers can overtly influence correlation coefficient

4) Nonlinear associations can’t be detected by correlation coefficients

5) The same correlation coefficient can arise from very different data –> still need to be careful

6) Rates of averages tends to inflate correlation coefficient. I.e. a line between the two variables which are group means tends to overestimate strength of association between the two variables

7) Association doesn’t mean causation

8) Small SDs can make correlation look bigger

9) Beware of extrapolating beyond the range of the regression line

10) A high correlation coefficient that fits regression line might not even have data which is linear

11) Beware of refitting – even though correlation coeff might be same if x and y are switched, we need to refit the model depending on what fits the context

What is the prediction error?

Difference between the line of regression (predicted value) and a certain point (actual value)

What can be used to measure prediction error?

RMS error. It represents the average gap between the points and the regression line. I.e. 'standard deviation for the line'

What is the formula for RMS error (pop)

root of (mean of (gaps) ^2) = root (1 - r^2 x SD of y)

What are some important notes of RMS error?

Perfect correlation ( r = -1, 1) --> RMS error = 0 r = 0 --> RMS error = SD of y If we use mean of y for any x (baseline prediction) --> RMS error = SD of y To calculate RMS errors for sample vs population, only difference is multiplication of popsd for population compared to sd for sample

What are 4 methods of making predictions

Baseline prediction Prediction in strip Regression line Predicting percentile ranks

What does baseline prediction involve?

Given a certain value of x, basic prediction of y would be the avg of y over all x values in the data

What does prediction in strip involve?

Average of all y values in data corresponding to that x value

What does prediction through regression line involve?

Use the given equation for regression line to predict y

What does prediction through percentile ranks invovle?

If x is at a certain percentile of all x's, find the percentile which we coiuld predict y to be in Steps: Find z score in x direction (Zx) FInd predicted z score in y direction ( = r x Zx) Translate z score in y direction back to percentile in y direction

What does homoscedastic mean?

Variance of residual or error term is constant

What does heteroscedastic mean

Unequal variance of residual

How do we know if something is homoscedastic

You can tell if a regression is homoskedastic by looking at the ratio between the largest variance and the smallest variance. If the ratio is 1.5 or smaller, then the regression is homoskedastic. If vertical strips on a scatter plot show equal spread in the y direction it is homoscedastic RMS error can be used as a measure of spread for individual strips

How do we know if something is heteroscedastic

If vertical strips dont show equal spread on a y direction in a regression line (or the residual plot)

What are the implications of homoscedasticity?

Normal approximation can be used within the vertical strips

How do we get the normal distribution of the strip?

New mean of y in new strip = mean of y + r x (Z score of x) x (SD of y) New SD of y = RMS error (assume population)

Topic 5: Linear Regression Flashcards

(40 cards)