Lecture 12_Influence of outliers, assumptions, and multicollinearity in MLR Flashcards
What is an outlier?
An outlier is an observation, or case, with such an extreme value on one (or more) variables, that it distorts statistics.
What is the relationship between outliers and sample size?
The influence of outliers will be greater the smaller the sample size (Correlation and Regression procedures are very sensitive to outliers).
What can be done to minimize the impact of chance outliers?
Use reasonably large sample size ( N = 100 or more)
• This recommendation is not based upon statistical power, but on the Law of Large Numbers, the Central Limit Theorem, and common sense
What are some ways of detecting Multivariate Outliers using Case Statistics in MLR?
- Mahalanobis Distance
- Cook’s Distance
- DFFIT
- DFBETA
- Standardized DFFIT and DFBETA (interpreted ans z-scores)
What is Mahalanobis Distance?
the distance of a case from the centroid of all the cases.
• You can think of it as a multivariate z-score.
• It can be evaluated for significance with the χ² [Chi square] distribution (using α = 0.001 and df = # of variables).
What is the trick to get SPSS to calculate Mahalanobis Distance based on all variables?
- As calculated by SPSS REGRESSION, it is based only on the predictor variables (X’s) included.
- a “trick” to get it based on all variables is to regress Sample ID (“outcome” variable) onto all X’s and Y (Y is treated as a predictor).
- This should be done before conducting substantive multiple regression analyses.
- Graph Sample ID (x-axis) and Mahalanobis Distance (y-axis) to get a visual of potential outliers (points above critical value of Chi Square)
What is Cook’s Distance?
a global (composite) measure of a case’s influence within the regression analysis. • It expresses how much the regression coefficients (b's) would change if a case was excluded. • It is effected by a case being an outlier on Y and/or on the set of predictors (X's). • Values > 1.0 are generally considered large (cut), but it can also be tested using the F distribution.
What is DFFIT?
a global measure of influence on the regression equation as a whole (indicates how much the case’s fitted value (Y′) will change if the case is excluded).
– provides information interchangeable with Cook’s distance (but on the scale of the Dependent Variable)
What is DFBETA?
a coefficient-specific measure of influence
– indicates how much each regression coefficient, intercept and slopes (a and b’s), will change if a case is excluded.
What is an advantage for using standardized values for DFFIT and DFBETA?
may be interpreted as z-scores (SD’s away from the mean), so > 2 is a concern.
In practice, what is a good habit to adopt when deciding whether to keep or exclude outliers?
run both ways (with and without outliers) and see if there are any substantial differences (aka. “sensitivity analysis”)
What is a common causes of outliers?
Human error in data entry (always check your raw data!)
Which is the most helpful case statistic?
Cook’s distance (not affected by either transformations: centering predictor variables or standardizing all variables); use DFFIT and DFBETAs with caution.
- DFFIT not affected by centering predictors, but is affected by standardizing all variables.
- DFBETA (centering predictors): intercept affected, but slopes are not.
- DFBETA (standardizing all variables): intercept and slopes are affected.
Dr. Becksted’s Recommended approach…
• Prior to conducting substantive MLR, regress all relevant variables (X’s and Y) onto a variable such as subject ID and request Mahalanobis Distances.
• Exclude any identified outliers based on conservative critical value of χ² at α= .001 and df = # of variables.
• Then run substantive MLR and request Cook, DFFIT, DFBETA & standardized versions of DFFFIT & DFBETA.
• Examine the distribution of these case statistics for influential outliers using DESCRIPTIVES command. If these statistics concur in identifying problem cases, exclude these cases and re-run substantive MLR.
[caution: once outliers excluded, others may be identified on the re-analysis, so use caution and common sense].
Violation of an assumption in MLR may potentially lead to one of which 2 problems?
– Estimates of the regression coefficients may be biased.
– Estimates of the standard errors of the regression coefficients may be biased.
(Bias = the estimate based on the sample data will not, on average, equal the true value)