9. Assumptions and Diagnostics Flashcards
What are the different linear model assumptions?
Linearity
Independence of Errors
Normality of Errors
Equal Variance
What happens if assumptions are violated?
Model can’t be accurate
Why is it useful visualise assumptions?
Easier to see nature and magnitude of assumption violation
What are some drawbacks of using statistical methods of assessing assumptions?
Suggests assumptions are violated when they are actually small
This is due to statistical power and give no info on actual problem
What is linearity?
Assumes X & Y are linear
What happens if we estimate a linear relation when there isn’t one present?
Can result in underestimating that relation
How is linearity investigated?
Investigated via scatterplots with loess lines (single variance)
or
Component residual’s plots (when we have multiple predictors) - Closer to black line = More linear
How do we test for non-linearity?
Need to know relations between each predictor and outcome are linear by controlling other predictors
Partial residuals for x value
ei + BjXij (partial linear relation xj + y)
What are normally distributed errors?
Assumes error are normally distributed about each predicted value
How are normally distributed errors investigated?
Investigated via QQ plots (Quantile comparisons plots)
- Plot standardized residuals from model against theoretical expected values
- If normally distributed = Should fall neatly on diagonal plot
- If non-normally distributed = Impact shape
- Can also use histograms to see distribution
What is equal variance (Homoscedasticity) ?
Assumes variance is constant across values of the predictors x1, … xk + across values of the fitted values y
What is it called when homoscedasticity is violated?
Heteroscedasticity
How is equal variance investigated?
Using residual plot
When comparing residual values vs predicted values = Should be the same difference above the line and below the line
Categorical predictors = Should show similar spread
Continuous predictors = Should be dots that follow the line closely
What is independence of errors?
Assumes errors are not correlated with one another
How do we test independence of errors?
Difficult to test, unless we know the potential source of correlation between cases
Errors are not correlated in between persons design
Can use a variant of linear model to account for independence
What are linear model diagnostics and what the three features?
Explore individual cases in context of model
Model outliers, High Leverage, High Influence
What are model outliers?
Cases that have unusual outcome values given their predictor values
(Show a large difference between predicted and observed)
What are outliers?
Large residuals = May have a strong influence on the model
How do you determine an outlier?
Size of outlier
Unstandardized individuals (same units as DV)
yi - yi (estimated)
Fine for comparison across cases in lm but difficult to compare across DVs with different units
What is the difference between standardised and studentized residuals?
Standardized residuals:
- Unstandardized/Estimate SD (convert to z-score)
- Calculations can include outliers when using whole data
Studentized residuals:
- Standardized residuals w/o extreme case values of > +2 or < -2 indicate outlyingness
What are high leverage cases?
Have unusual predictor value or combination of predictor values (e.g. x for away from x bar (mean))
How do you find high leverage cases?
Hat values used to assess as it’s the difference between value of x and mean value of x
Mean of hat value x 2 = High leverage
What are high influence cases?
Cases having a large impact on estimation of model
Have a strong effect on coefficients - e.g. if deleted a case, coefficient would change
How do we investigate high influence cases?
Degree of change = one way to judge magnitude of influence
Can also consider influence via cook’s distance and DF beta
What is cook’s distance?
The average distance y hat values will move if given case removed
Different cut-off suggestions
Di > 1
Dj > 4/n-k-1
What are the three different ways you can look at Cook’s Distance in more detail (name only) ?
DFFit
Dfbeta
DFbeats
What is DFFit?
Difference between predicted outcome (Y) of case vs without case included
What is Dfbeta?
Difference between value of coefficient if case included vs not included
This is how it varies for DFFit - focuses on coefficient
What is Dfbeats?
Standardised version of Dfbeta
How do we examine influence of SE?
SE can impact inferences
Measure via COVRATIO
<1 = Precision is decreased by a case (SE increases)
>1 = Precision increased by case (SE decreases)
COVRATIO > 1 -/+ [3(k+1)/n) = have strong influence on SE
What is multi-collinearity?
Correlation between factors
Large correlation between predictors = Increased SE so don’t want predictors to be too correlated
What do you do if multi-collinearity occurs?
If it happens = Combine two predictors into single composite or drop IV that is statistically redundant
How do you test for multicollinearity?
Variation inflation factor (VIF) - Measures how much SE(beta) is increased by predictor correlations
- VIF quantities are increased by predictor inter-correlations
- VIFs > 10 = Issue - want it close to 1
- Always consider influence before deleting variable
What are sensitivity analyses?
Checking if you get similar results, irrespective of methodological decisions
Do coefficients change if including certain case
What if the results from sensitivity analysis are similar?
Increased confidence results x based on methodology but strength
If a case has a high COVRATIO value but a low dfbeta, what is the most likely reason?
It has an extreme value on x but is not a regression outlier