Topic 5 - Linear Model Flashcards
LO
LO5 Model and explain the relationship between two variables using linear regression.
Key steps in linear regression
- Produce a scatterplot
- Calculate correlation coeficient
- Produce a regression line
- Produce a residual plot
- Check assumptions fit
- Perform predictions with the data
Step 1: Produce a scatterplot
Pair of variabless (x = IV, y = DV)
- Scatterplot allows us to do an IDA and get an initial impression if a linear model is appropriate
Step 2: Calculate correlation coefficient
Linear correlation
- How tightly the ‘cloud’ of values cluster around a line through the middle
- Tight cluster = strong correlation
Correlation Coefficient
- ‘r’ is a numerical summary which measures the clustering of points around a line
- It indicated both the sign and strength of the linear association
- Between -1 and 1
Population correlation coefficient
‘rpop’ is the mean of the product of the variables in standard units
Population Vs Sample
rpop = whole population
rsample = sample of population
- Both formulas give the same result
Properties of the correlation coefficient
Value:
- when r = +/- 1, all point lie on the regression line
Symmetry:
- Correlation coefficient is NOT affected by interchanging variables, (swapping x & y aves = same r value)
Scaling/ Shifting:
- ‘r’ will always stay the same if variables shifted or multiplied
Step 3: Produce a regression line
Uses the 5 summaries (x̄, ȳ, SDx, SDy, r)
Regression line connects (x̄, ȳ) to
(x̄ + SDx, ȳ +SDy)
Step 4: Produce a residual plot
Residual:
- Is the vertical distance/ gap of a point above & below the regression line
- Represents the error between the actual values and the prediction
Residual plot
- Graphs the residuals Vs. ‘x’
If a linear regression is appropriate, then:
- The residual plot should show no pattern
- Should be random about a horizontal line at zero
- SHould have constant variance within vertical strips along the x axis
Step 5: Check assumptions
2 main diagnostic checks:
- Does scatterplot look linear
- Does residual plot look random/ have homoscedasticity
Step 6: Perform Predictions
Only when satisfied with step 5, we can make predictions
Most common mistakes in regression
- Interpret ‘r’ as a percentage
- Comparing 2 values of ‘r’ as percentages
- Underestimate effects of outliers on ‘r’
- Assuming that strong correlation means good fit for the regression line
- Assume that 3 datasets with similar r values will be similar to eachother
- Inflating the linear association by grouping data
- Mistaking causation for association
- Rearranging rather than refitting
- Extrapolating withoug justification
- Forgetting to check the scatterplot
Prediction error (RMS Error)
RMS error:
- Represents the average gap between the points and the regression line
squareroot (1-r^2) x SDy