assumptions and dignosics Flashcards
assumptions of linear model (LINE)
L = linearity
I = independence of errors
N = normality of errors
E = equal variance (homoscedecity)
linearity
assumption = the relationship between x and y is linear
investigated with:
- scatterplots with loess lines (for single variables)
- component-residual plots (multiple predictors)
component-residual plots
used for testing linearity between each predictor and the outcome, controlling for other predictors (also called partial residual plots)
have x values on the x axis and partial residuals on the y axis.
partial residuals for each x variable are:
residual from lm including all predictors + partial (linear) relation between x and y
Normally distributed errors
assumption = the errors are normally distributed around each predicted value
investigated with:
- QQ-plots
- histograms (plot the frequency distribution of the residuals)
QQ-plots
quantile comparison plots
used to test the normality of errors
- plots the standardised residuals from the model against their theoretical values. if the points are normally distributed they should fall neatly on the diagonal
equal variance (homoscedacity)
assumption = the variance is constant across values of the predictors (x) and across values of the fitted values (^y)
- heteroscedacity refers to the violation of this assumption
investigated with:
- plot the residual values against predicted values -
categorical predictors should show a similar spread of residuals across their fitted values while plots for continuous should look like a random array of dots
independence of errors
assumption = the errors are NOT correlated with one another
difficult to test unless we know the potential source of correlation between the cases - for now, we evaluate this based on study design
diagnostics of linear models
- model outliers
- high leverage case
- high influence cases
model outliers
cases for which there is a large discrepancy between their predicted value (^y) and their observed value (y)
- demonstrate large residuals so judged on a case-by-case basis
high leverage cases
cases with an unusual value of the predictor (x)
- these have potential to influence our predicted β values
- assessed with hat values
high influence cases
cases that have a large impact on the estimation of the model
- can have strong influence on our β coefficients and therefore can influence our model results and conclusions
- assessed with cook’s distance and dfbeta
studentised residuals
provide a version of standardised residuals excluding the outlying cases - values >+2 and <-2 indicate outlyingness
hat values (hi)
used to assess leverage of linear models
equation:
hi = 1/n + ( ( x - mean of x)^2 / sum of (x - mean of x)^2 squared)
the mean of hat is:
h = (k +1)/n
general rule = values more than 2*mean of hat are considered high leverage
Cook’s distance
cooks distance refers to the average distance the ^y values will move if a given case is removed - determines if cases are high influence
Di = ( (studentised residual i)^2 / (k+1) ) * (hi / 1-hi)
Di = outlyingeness * leverage
there are many different suggestions for cut offs but a common one is Di > 4/n-k-1
DFfit
difference between the predicted outcome values for a case, with vs without the inclusion of a different case