Regression Diagnostics and Assumptions Flashcards
Describe the data type assumptions of regression
- DV must be interval/ratio
- Predictors must be interval/ratio or only have 2 responses
- Can use categorical predictors but have to be recoded into dichotomous dummy variables.
What do you do to assess whether the regression model is accurate for the sample?
Perform regression diagnostics
Look at outliers
What do you do to check if the regression model can be generalised?
Assumptions assessment
Look at distribution of residuals
Outliers
A case that differs substantially from the main trend of the data.
Problematic as can affect the precision of the estimation of the regression coefficient
Spotted by conducting regression diagnostics
What are the approaches to regression diagnostics?
Searching for large residuals. Good to use a scatter plot.
Searching for influential cases
What should you do if you find a small residual?
Can be hugely influential so spot by searching for influential cases which is different to large residuals
How do you detect outliers by searching for large residuals?
Look at the standardised residuals (turned into z-scores) for individual cases as this makes interpretation of their size of measurement easier.
Standardised residuals below -3.0 and above +3.0 are a cause for concern.
Should also be concerned if more than 5% of values have a residual belo -2.0 or above +2.0 because that exceeds what we would normally expect.
Look at casewise diagnostics in SPSS. Any participant with a large residual is displayed.
How do you detect outliers by searching for influential cases?
Look at Cook’s distance in residuals statistics. This concerns how much predicted scores for other cases would differ if the case in question were not included.
Cooks distance should not exceed 1.
What should you do if you find outliers?
- Ensure that outliers aren’t due to data entry error.
- Transform data (usually not a good idea)
- Consider deleting the case responsible for the outlier, but only if it produces a very large distortion. In doubt, report results for samples with and without outlier.
Explain the assumptions about residuals in regression
Normality - Residuals should be normally distributed. Histogram, scatterplot.
Linearity - Residuals should have a straight line relationship with predicted outcome scores. Look at scatterplots.
Homoscedasticity - Residuals should be equally distributed across regression line.
Explain the assumption of independence of error
For any two observations, the errors of prediction (residuals) should be uncorrelated. Error produced by two observations should not be due to the same reason.
Look at the Durbin-Watson index, tests if adjacent residuals are correlated. Varies between 0 and 4 with 2 meaning a lack of correlation.
- 2+ negative correlation
- -2 = positive correlation
Size depends on numb of predictors and numb of observations.
What should you do if you violate a regression assumption?
Re-run regression using bootstrapping.
Multicollinearity
Occurs when two predictors are very strongly correlated.
Check by looking at correlation output to check correlations are not too high, .80+
High m suggests that the variables are measuring the same thing. Best to remove if have high correlations. Always a good idea to inspect.
Not a problem when the purpose of research is to predict the outcome variable from a set of predictors. A problem when you are interested in the separated effects of different predictors on an outcome.
Tipping Effect
Concern multicollinearity
When two predictors differ slightly in their bivariate relationship with the outcome, they may end up differing greatly in their regression coefficients.
What are the solutions to multicollinearity?
- Delete one or more variables from the model. Can be done if correlations between two predictors are very high e.g. .90
- Combine the collinear variables into a composite variable. Good when correlation around .80.