Lecture 12_Influence of outliers, assumptions, and multicollinearity in MLR Flashcards

1
Q

What is an outlier?

A

An outlier is an observation, or case, with such an extreme value on one (or more) variables, that it distorts statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the relationship between outliers and sample size?

A

The influence of outliers will be greater the smaller the sample size (Correlation and Regression procedures are very sensitive to outliers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can be done to minimize the impact of chance outliers?

A

Use reasonably large sample size ( N = 100 or more)
• This recommendation is not based upon statistical power, but on the Law of Large Numbers, the Central Limit Theorem, and common sense

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some ways of detecting Multivariate Outliers using Case Statistics in MLR?

A
  • Mahalanobis Distance
  • Cook’s Distance
  • DFFIT
  • DFBETA
  • Standardized DFFIT and DFBETA (interpreted ans z-scores)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Mahalanobis Distance?

A

the distance of a case from the centroid of all the cases.
• You can think of it as a multivariate z-score.
• It can be evaluated for significance with the χ² [Chi square] distribution (using α = 0.001 and df = # of variables).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the trick to get SPSS to calculate Mahalanobis Distance based on all variables?

A
  • As calculated by SPSS REGRESSION, it is based only on the predictor variables (X’s) included.
  • a “trick” to get it based on all variables is to regress Sample ID (“outcome” variable) onto all X’s and Y (Y is treated as a predictor).
  • This should be done before conducting substantive multiple regression analyses.
  • Graph Sample ID (x-axis) and Mahalanobis Distance (y-axis) to get a visual of potential outliers (points above critical value of Chi Square)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Cook’s Distance?

A
a global (composite) measure of a case’s influence within the regression analysis.
• It expresses how much the regression coefficients (b's) would change if a case was excluded.
• It is effected by a case being an outlier on Y and/or on the set of predictors (X's).
• Values > 1.0 are generally considered large (cut), but it can also be tested using the F distribution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is DFFIT?

A

a global measure of influence on the regression equation as a whole (indicates how much the case’s fitted value (Y′) will change if the case is excluded).
– provides information interchangeable with Cook’s distance (but on the scale of the Dependent Variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is DFBETA?

A

a coefficient-specific measure of influence

– indicates how much each regression coefficient, intercept and slopes (a and b’s), will change if a case is excluded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an advantage for using standardized values for DFFIT and DFBETA?

A

may be interpreted as z-scores (SD’s away from the mean), so > 2 is a concern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In practice, what is a good habit to adopt when deciding whether to keep or exclude outliers?

A

run both ways (with and without outliers) and see if there are any substantial differences (aka. “sensitivity analysis”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a common causes of outliers?

A

Human error in data entry (always check your raw data!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which is the most helpful case statistic?

A

Cook’s distance (not affected by either transformations: centering predictor variables or standardizing all variables); use DFFIT and DFBETAs with caution.

  • DFFIT not affected by centering predictors, but is affected by standardizing all variables.
  • DFBETA (centering predictors): intercept affected, but slopes are not.
  • DFBETA (standardizing all variables): intercept and slopes are affected.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Dr. Becksted’s Recommended approach…

A

• Prior to conducting substantive MLR, regress all relevant variables (X’s and Y) onto a variable such as subject ID and request Mahalanobis Distances.
• Exclude any identified outliers based on conservative critical value of χ² at α= .001 and df = # of variables.
• Then run substantive MLR and request Cook, DFFIT, DFBETA & standardized versions of DFFFIT & DFBETA.
• Examine the distribution of these case statistics for influential outliers using DESCRIPTIVES command. If these statistics concur in identifying problem cases, exclude these cases and re-run substantive MLR.
[caution: once outliers excluded, others may be identified on the re-analysis, so use caution and common sense].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Violation of an assumption in MLR may potentially lead to one of which 2 problems?

A

– Estimates of the regression coefficients may be biased.
– Estimates of the standard errors of the regression coefficients may be biased.
(Bias = the estimate based on the sample data will not, on average, equal the true value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are 7 assumptions of MLR?

A
  1. The dependent variable is a linear function of the independent variables.
  2. The observations are independent of one another.
  3. Homoscedasticity – the variance of the residuals is not a function of any independent variables.
  4. The residuals (prediction errors) are normally distributed.
  5. The dependent variable does not influence any of the independent variables.
  6. The independent variables are measured without error.
  7. The regression model must include all common causes of the presumed causes (IVs) and the presumed effect (DV).
17
Q
  1. The DV is a linear function of the IV’s.
A

– assumes that the mean of the Y scores at each value of X fall along a straight line in the population.
– If violated, ALL the estimates (R², coefficients, & standard errors) may be biased.
– most important of the assumptions.

18
Q
  1. The observations are independent of one another.
A

– Independence of the residuals in the population assumed.
– Must be no relationship among the residuals for any subset of the cases in the analysis.
– Assumption is met in any random sample from a population, [however if the data are clustered (e.g, multiple sites), or temporally linked (e.g., pre-& post-test), the residuals may not be independent.]
– If violated: coefficients are fine, but the standard errors will be incorrect (usually too small).

19
Q
  1. Homoscedasticity (“Homo-ski-das-ticity)
A

– The variance of the residuals not a function of any predictor(s).
– When violated: coefficients are fine, but standard errors will be incorrect.
– In practice, standard errors will be very close to the correct value unless there is a very large (10:1) ratio of the nonconstant conditional variances.

20
Q
  1. The residuals (prediction errors) are normally distributed.
A

– For any value of an IV, the residuals around the regression line are assumed to have a normal distribution.
– Violations do not affect the coefficients, but can bias standard errors, especially when sample sizes are small.
–(one more reason to use large sample sizes)

21
Q
  1. The dependent variable does not influence any of the independent variables.
A

–This is the assumption that the model is recursive, meaning that the causal flow is in one direction (X –> Y).
–The model does not contain feedback loops (X Y).

22
Q
  1. The independent variables are measured without error.
A

– IVs have perfect reliability (rₓₓ= 1.0).
– When violated, the magnitude of regression coefficients and R² will be attenuated (i.e., fall too close to 0).
– never happens in practice

23
Q
  1. The regression model must include all common causes of the presumed causes (IVs) and the presumed effect (DV).
A

– All confounder variables need to be included for the effects of key IVs to be estimated correctly.
– Impossible, but do the best you can

24
Q

Which assumptions give us the most grief?

A

3 and 4

25
Q

What are the key assumptions for unbiased estimation of the standard errors?

A

• Normality (4) and homoscedasticity (3) of the residuals

[means that tests of significance and confidence intervals will not be correct]

26
Q

How can assumptions regarding the distribution of residuals can be assessed?

A

graphically

27
Q

How can Normality assumption be assessed?

A

histograms and normal quantile plots

28
Q

How can homoscedasticity be assessed?

A

using scatter plots and binned bar-charts

29
Q

Does MLR make any assumptions about the distributions of the independent or dependent variables?

A

No, only the distribution of the residuals from the regression analysis (normally distributed).

30
Q

What condition needs to be met so that MLR is fairly robust to modest violations of assumptions?

A

sample size is reasonably large

31
Q

What is Multicollinearity?

A

a property of the correlations among a set of predictor variables (IVs) - [includes bivariate correlations between predictors and correlations between a given predictor and the linear combination of all other predictors].

32
Q

With multicollinearity, what causes interpretation and computational problems for regression analysis?

A

When one or more of these correlations is large.
• When the correlation between two predictors is about r ≈ |.70| there is ambiguity in interpreting regression coefficients [note that here r² ≈ 50% shared variance between the two predictors.]
• When the correlation between two predictors is about r ≈ |.90|, estimates of the regression coefficients and their standard errors become quite unstable.

33
Q

What are some variable-specific diagnostic statistics for assessing multicollinearity?

A
  • VIF –the variance inflation factor,

* Tolerance

34
Q

What is the variance inflation factor (VIF)?

A

– indicates the inflation in the variance of the regression coefficient as a consequence of the correlations among the IVs.

35
Q

What is Tolerance?

A

the proportion of variance in a predictor variable that is orthogonal to (independent of) the other predictor variables, and thus available to uniquely predict Y.
– ie. how much of variance is left over to predict R²

36
Q

What does the square root of the VIF tell us?

A

the proportion the SE of the regression coefficient will increase (i.e., be inflated) relative to the case where the
IVs are uncorrelated.

37
Q

As correlations among predictors increases, what happens to R²?

A

R² gets smaller