Lecture 10 Sabina Flashcards
Why is the residual important?
Whatever was left after feeding the original predictor is the residual, and this can unexplained variance can be higher than what you DID explain. It has important information.
How do you know that a regression line not linear?
By looking at the residuals in a scatterplot.
How will you know if you’re testing the same population or different populations?
By looking at residuals, you MUST test assumptions
What are the assumptions underlying MR?
- Dependent variable a linear function of the IVs
- Each observation drawn independently
- Homoscedasticity of variance
- Errors are normally distributed, with the mean = 0
How do you know if you have a DV that is a linear function of the IVs? (first assumption)
Why is it important?
Can plot the DV against the IV
- Can fit quadratic (etc) term(s) in the regression
- A more detailed examination using scatterplots
- Plot residuals
- Plot the residuals against the IV
Why?
- e = Y’-Y (errors in prediction)
- If there is a departure from linearity, it’d be more magnified with the plot of the residual terms
How do you in SPSS find the non-paraetric best fitted line that is not linear?
Lowess plot function in SPSS - not a straight line. If there is no pattern left there should be a straight line. If there is a pattern, this line is important to you.
How do you Diagnose Violation of Linearity Assumption
Horizontal line is the mean of e’s
- It is flat because effect of IV of
interest (homework) was removed, and homework now has nothing to do with the residuals
- When two variables are unrelated, the best fitted line is just a mean of Y
› If the assumption of linearity is not violated, there is little or no departure from the regression line
› If the assumption of linearity is violated, Lowess fit line will look somewhat curvilinear
If there is no departure from linearity in Unstandardised Residuals Against Predicted Scores
it should be close to the regression line
problem with the Assumption of Linearity Underlying MR
As you just saw, if there is only slight departure from linearity, you can easily miss it when using the scatterplots
› So, it is beneficial to use all methods.
› If theory and data suggest non-linearity, build non-linear
term(s) in the regression equation and test for their statistical
significance
What happens if data not drawn independently (e.g.,
possibility of clusters)
there is a risk of violating assumption that the residual
terms are independent
› Violation of this assumption affects SEb
- Underestimation of errors is dangerous for hypotheses testing
› This danger lessens with larger N and “sophisticated” sampling techniques
Box plots will be all different sized boxes.
What do we need to watch out for with Homoscedasticity of variance?
Butterfly pattern or two large clusters.. violates assumption. variance figures are totally different.
Keith’s rule of thumb: if the ratio of
high to low variance < 10, not a problem
But, there are other tests
if residuals are normally distributed…
the scatter is close to the ‘ideal’ line.
Most common errors in data
- problem with coding
- compared different samples accidentally
- need to eliminate some subgroups
Distance, leverage and influence
Some people the model will over predict, or under predict. You should look at the size of these residuals.
The outlier that has the largest residual needs to be identified by distance from the actual value.
Don’t remove anything just because of the error component (unethical), but if you did remove it, results would change. This person could have a learning disability (a variable not controlled for)
Leverage
Refers to a ‘suspicious’ pattern of values in IVs
Diagnostic technique:
Graph IVs against each other
- SPSS provides leverage statistics, so you can examine them
- Rule of thumb based on the statistic (k+1)/N, where k=number of IVs and
N=sample size, see Keith for more details