assumptions and dignosics Flashcards

1
Q

assumptions of linear model (LINE)

A

L = linearity
I = independence of errors
N = normality of errors
E = equal variance (homoscedecity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

linearity

A

assumption = the relationship between x and y is linear

investigated with:
- scatterplots with loess lines (for single variables)
- component-residual plots (multiple predictors)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

component-residual plots

A

used for testing linearity between each predictor and the outcome, controlling for other predictors (also called partial residual plots)
have x values on the x axis and partial residuals on the y axis.
partial residuals for each x variable are:
residual from lm including all predictors + partial (linear) relation between x and y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Normally distributed errors

A

assumption = the errors are normally distributed around each predicted value

investigated with:
- QQ-plots
- histograms (plot the frequency distribution of the residuals)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

QQ-plots

A

quantile comparison plots
used to test the normality of errors
- plots the standardised residuals from the model against their theoretical values. if the points are normally distributed they should fall neatly on the diagonal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

equal variance (homoscedacity)

A

assumption = the variance is constant across values of the predictors (x) and across values of the fitted values (^y)
- heteroscedacity refers to the violation of this assumption

investigated with:
- plot the residual values against predicted values -

categorical predictors should show a similar spread of residuals across their fitted values while plots for continuous should look like a random array of dots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

independence of errors

A

assumption = the errors are NOT correlated with one another

difficult to test unless we know the potential source of correlation between the cases - for now, we evaluate this based on study design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

diagnostics of linear models

A
  • model outliers
  • high leverage case
  • high influence cases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

model outliers

A

cases for which there is a large discrepancy between their predicted value (^y) and their observed value (y)
- demonstrate large residuals so judged on a case-by-case basis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

high leverage cases

A

cases with an unusual value of the predictor (x)

  • these have potential to influence our predicted β values
  • assessed with hat values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

high influence cases

A

cases that have a large impact on the estimation of the model

  • can have strong influence on our β coefficients and therefore can influence our model results and conclusions
  • assessed with cook’s distance and dfbeta
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

studentised residuals

A

provide a version of standardised residuals excluding the outlying cases - values >+2 and <-2 indicate outlyingness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

hat values (hi)

A

used to assess leverage of linear models

equation:
hi = 1/n + ( ( x - mean of x)^2 / sum of (x - mean of x)^2 squared)

the mean of hat is:
h = (k +1)/n

general rule = values more than 2*mean of hat are considered high leverage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cook’s distance

A

cooks distance refers to the average distance the ^y values will move if a given case is removed - determines if cases are high influence

Di = ( (studentised residual i)^2 / (k+1) ) * (hi / 1-hi)
Di = outlyingeness * leverage

there are many different suggestions for cut offs but a common one is Di > 4/n-k-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DFfit

A

difference between the predicted outcome values for a case, with vs without the inclusion of a different case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DFbeta

A

the difference between the values for a coefficient, with and without a different case included

17
Q

DFbetas

A

standardised version of DFbeta - obtained by dividing an estimate of the standard error of the regression coefficient with the case removed

18
Q

COVRATIO

A

influential case’s influence on SE is measured using the COVRATIO statistic
- a covratio value <1 shows that precision is decreased by a case (SE is increased)

cases with COVRATIO > 1 + [3(k+1)/n] or < 1 - [3(k+1)/n] can be considered strong influence on standard errors

19
Q

Multi-collinearity

A

refers to the correlation between predictors. when there are large correlations between predictors the SE are increased

neither an assumption nor a diagnostic

20
Q

Variance inflation factor

A

VIF quantifies the extent to which standard errors are increased by predictor inter-correlations
we want VIF values to be close to 1, anything over 10 indicates a problem

21
Q

dealing with unusual cases

A
  • is there a data entry error? can it be corrected? if not it should be deleted
  • is the data legitimate but extreme? consider ways to reduce inflation
  • are there unusual values as a result of skewness
22
Q

sensitivity analysis

A

refers to the idea of checking whether you get similar results irrespective of the methodological decisions you make