6. Regression assumptions, Diagnostics and Influencial cases Flashcards

1
Q

How many assumptions are there for multiple linear regression?

A

9 mathematical
+2 design
=11

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 2 design assumptions of multiple linear regression?

A

Independence (each participant only 1 score on each IV)

Interval Scale on IV and DV (or dichotomous IV)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 9 mathematical assumptions of multiple linear regression

A

Normality (6 sub assumptions)

No multicollinearity (3 ways to check)
Linearity 

Normal distribution of residuals
Independent Residuals
Residuals unrelated to predictors

Homogeneity of Variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 6 tests of normality?

A
Symmetry
Modality 
Skew 
Kurtosis 
Outliers 
Shapiro-Wilk
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What do you check with the assumption of symmetry?

A

Mean = Medium = Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What do you check for in modality?

A

Only 1 most frequently occurring score (Unimodal not multi/bimodal)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do you check for in skew and kurtosis?

A

Skew / SE Skew

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What constitutes an outlier?

A

95% of cases should be 1.96

No more than 3% of cases should be >2.58

If there are… they are outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do you check for in the shaprio-wilk statistic?

A

That it is not sig. (>.05)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 3 checks for multicollinearity?

A

Pearson Correlations .01

VIF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does VIF stand for?

A

Variance inflation factor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Where and what for do you look to check whether the residuals have a normal distribution?

A
Mean of residuals = 0 
No skew (Snaking) and No Kurtosis (Sag) in the P-P plot and histogram
No outliers in the histogram
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why are the residual statistics so important?

A

Because if they aren’t normally distributed then we can’t say that 68% of cases will fall within + or -
the RMSE of the regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we check linearity and why do we check it?

A

Using pearson correlation

because if the IV is not related to the DV then it can’t be a good predictor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the Independence of Residuals tested?

A

Using the Durbin-Watson

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the Durbin Watson show?

A

The independence of residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When reading the Durbin-Watson, what are we looking for to meet our assumption of independent residuals?

A

Values between 1.5-2.5

Actual range from 1 = strong pos. to 4= strong neg.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What do we look at to test homogeneity of variance?

A

The scatterplot of standardized predicted value against the studentized residual
(don’t want any funnelling or patterns)

19
Q

What is it called if there is funnelling (and an unequal distribution on either side of x= 0) where the assumption of homogeneity is not met?

A

Heteroscedasticity

20
Q

What is evidence of heteroscedasticity?

A

No pattern or funnelling

Equal distribution on either side of x=0 (divide graph in half)

21
Q

How is the ‘Residuals unrelated to predictors’ assumption checked?

A

By obtaining a Pearson correlation between each IV and the unstandardized residuals (RES_1)

22
Q

What should the correlation between the predictors and the residuals be?

A

0 and non sig.

23
Q

What should we do if the assumptions are violated?

A

Question the validity of the model and caution about the interpretations

24
Q

What three violations of assumptions cause the most problems for linear regression?

A

Normality (especially on the DV)
Homogeneity of Variance
Presence of outliers

25
Q

What are the 3 options if normality is violated?

A
Transformation NO (if sig. skew)
Bootstrap YES (less biased)
Outliers FIRST (check influence)
26
Q

What are considered extreme cases in a data set?

A

> 2SD from the mean

27
Q

What are outliers a problem?

A

They affect the value of the estimated regression coefficients = biased model

28
Q

Where are the problem cases located in SPSS output?

A

Casewise Diagnostics

29
Q

What should you look at to determine the amount of influence the outliers are having?

A
Studentized Residuals (Y-Ypred: Error)
Influential cases
30
Q

What does it mean if a case has a large residual?

A

It doesn’t fit the model well and should be checked as a possible outlier

31
Q

What are the 3 types of residuals?

A

Unstandardized
Standardized
Studentized (most precise)

32
Q

What are the 8 statistics that can be used to assess the influence of a particular case on a model?

A
Adjusted predicted value.
Deleted residual and the studentized deleted residual.
DFFit and standardized DFFit.
Cook’s distance.
Leverage.
Mahalanobis distances.
DFBeta and Standardized DFBeta.
Covariance ratio.
33
Q

What is the rule for Adj Pred Value?

A

It should be = to predicted value

34
Q

What is the rule for the studentized deleted residual?

A

Within the range of -2 to 2

35
Q

What is the rule for Mahalanobis Distance?

A

When:
N= 500, 25+ = bad
N=100 and k=3, 15+=bad
N=30 and k=2, 11+=bad

36
Q

What is the rule for Cook’s distance?

A

1.0+ = bad

Close to 0 = good

37
Q

What is the rule for Leverage values?

A

If value is 2x more than average leverage value

Average Leverage value = (k +1) / n

38
Q

What is the rule for the covariance ratio?

A

If above upper end of range (> 1 + [3(k + 1)/n]) DON’T DELETE
If BELOW lower end of range (

39
Q

What is the rule for DFFit?

A

Depending on range of scale e.g. either 0-1 or 1-100… the DFFit value should be closer to 1 (i.e. in 0-1 a value of 0.5 is terrible but in 1-100 it’s nothing)

40
Q

What is the rule for the SD DFFit?

A

Should be between -2 and 2

41
Q

What is the rule for SD Df Beta?

A

If +-> 2 = bad

42
Q

What should we do if we remove the outliers?

A

Run the regression again and compare the new and old

43
Q

What should happen if the outliers have been correctly removed?

A

The RMSE should shrink
The Rsq should get larger
Assumptions should be closer to being met