assumptions and dignosics Flashcards

Question 1

Q

assumptions of linear model (LINE)

Answer

A

L = linearity
I = independence of errors
N = normality of errors
E = equal variance (homoscedecity)

Question 2

Q

linearity

Answer

A

assumption = the relationship between x and y is linear

investigated with:
- scatterplots with loess lines (for single variables)
- component-residual plots (multiple predictors)

Question 3

Q

component-residual plots

Answer

A

used for testing linearity between each predictor and the outcome, controlling for other predictors (also called partial residual plots)
have x values on the x axis and partial residuals on the y axis.
partial residuals for each x variable are:
residual from lm including all predictors + partial (linear) relation between x and y

Question 4

Q

Normally distributed errors

Answer

A

assumption = the errors are normally distributed around each predicted value

investigated with:
- QQ-plots
- histograms (plot the frequency distribution of the residuals)

Question 5

Q

QQ-plots

Answer

A

quantile comparison plots
used to test the normality of errors
- plots the standardised residuals from the model against their theoretical values. if the points are normally distributed they should fall neatly on the diagonal

Question 6

Q

equal variance (homoscedacity)

Answer

A

assumption = the variance is constant across values of the predictors (x) and across values of the fitted values (^y)
- heteroscedacity refers to the violation of this assumption

investigated with:
- plot the residual values against predicted values -

categorical predictors should show a similar spread of residuals across their fitted values while plots for continuous should look like a random array of dots

Question 7

Q

independence of errors

Answer

A

assumption = the errors are NOT correlated with one another

difficult to test unless we know the potential source of correlation between the cases - for now, we evaluate this based on study design

Question 8

Q

diagnostics of linear models

Answer

A

model outliers
high leverage case
high influence cases

Question 9

Q

model outliers

Answer

A

cases for which there is a large discrepancy between their predicted value (^y) and their observed value (y)
- demonstrate large residuals so judged on a case-by-case basis

Question 10

Q

high leverage cases

Answer

A

cases with an unusual value of the predictor (x)

these have potential to influence our predicted β values
assessed with hat values

Question 11

Q

high influence cases

Answer

A

cases that have a large impact on the estimation of the model

can have strong influence on our β coefficients and therefore can influence our model results and conclusions
assessed with cook’s distance and dfbeta

Question 12

Q

studentised residuals

Answer

A

provide a version of standardised residuals excluding the outlying cases - values >+2 and <-2 indicate outlyingness

Question 13

Q

hat values (hi)

Answer

A

used to assess leverage of linear models

equation:
hi = 1/n + ( ( x - mean of x)^2 / sum of (x - mean of x)^2 squared)

the mean of hat is:
h = (k +1)/n

general rule = values more than 2*mean of hat are considered high leverage

Question 14

Q

Cook’s distance

Answer

A

cooks distance refers to the average distance the ^y values will move if a given case is removed - determines if cases are high influence

Di = ( (studentised residual i)^2 / (k+1) ) * (hi / 1-hi)
Di = outlyingeness * leverage

there are many different suggestions for cut offs but a common one is Di > 4/n-k-1

Question 15

Q

DFfit

Answer

A

difference between the predicted outcome values for a case, with vs without the inclusion of a different case

Question 16

Q

DFbeta

Answer

A

the difference between the values for a coefficient, with and without a different case included

Question 17

Q

DFbetas

Answer

A

standardised version of DFbeta - obtained by dividing an estimate of the standard error of the regression coefficient with the case removed

Question 18

Q

COVRATIO

Answer

A

influential case’s influence on SE is measured using the COVRATIO statistic
- a covratio value <1 shows that precision is decreased by a case (SE is increased)

cases with COVRATIO > 1 + [3(k+1)/n] or < 1 - [3(k+1)/n] can be considered strong influence on standard errors

Question 19

Q

Multi-collinearity

Answer

A

refers to the correlation between predictors. when there are large correlations between predictors the SE are increased

neither an assumption nor a diagnostic

Question 20

Q

Variance inflation factor

Answer

A

VIF quantifies the extent to which standard errors are increased by predictor inter-correlations
we want VIF values to be close to 1, anything over 10 indicates a problem

Question 21

Q

dealing with unusual cases

Answer

A

is there a data entry error? can it be corrected? if not it should be deleted
is the data legitimate but extreme? consider ways to reduce inflation
are there unusual values as a result of skewness

Question 22

Q

sensitivity analysis

Answer

A

refers to the idea of checking whether you get similar results irrespective of the methodological decisions you make