The BEAST of Bias Flashcards
What are outliers/influential cases?
An atypical case, which influences the beta values
Can pull the model down
How do you detect outliers/influential cases?
Looking at residuals/deviation - predicted value will be far away from the observed value so large residual
Graphs - boxplots, histograms
Cooks distance - measures influence of a single case on the model, if bigger than 1 = problem
Standardised residuals
Why can’t you detect an influential case using residuals?
Influential case has been so influential, that it has dragged it down so much so the residual is small
What are the assumptions of the linear model?
Linearity and additivity
Spherical residuals - independent errors and homescadastic errors
Normality of something - residuals, sampling distribution
What does linearity and additivity refer too?
The relationship between the predictor and outcome is linear, otherwise used the wrong model
Predictors should be added in
How to check for linearity and additivity?
Graphs
Equation
Most important assumption
Linearity and additivity
What does spherical errors refer too?
Errors should be independent - not related to each other (autocorrelation)
Errors should be homoscadistic - variances should be consistent at different levels of the predictor variable
What does violation of spherical errors mean?
B’s unviolated but not optimal
SE incorrect, so P and CI’s will be incorrect as they use SE
How do you check for spherical errors?
Levene’s test - if significant, means HOV
Variances - similar size
Variance ratio - biggest variance divided by smallest
Graph of standardised (zresid) against predicted (spred) residuals
Durbin Watson - between 0-4 if independent
What do graphs of zresid against zpred show?
Funnel shape - heteroscadistic
Sausage shape - non linear
What does normally distributed refer too?
Normality of residuals - diff between observed and predicted value, b will be unbiased but other methods better
Normality of sampling distribution - 1.96 comes from normal dist so intervals will be wrong. P values associated with B assumes a normal distribution
What is the central limit theorm?
As long as the sample is big enough, sampling distribution will be normal
only worry about normality in small samples
Exploring normality
Graphs - histograms, box plots, pp-qq plots (if S shape, problem)
Numbers - skew and kurtosis, want them to be 0 (if above 2/2.5 = skewed)
Test: kolmogorov-smirnov K-S test - dont quant to use this
Ways of correcting problems
SD based trim (twin values outside of 2SD from mean) - really bad idea
Transform data - creates more problems
Winsorizing - substitute outliers with hight value which isn’t an outlier
Robust estimation - 20% trim, M estimators, the bootstrap or adjust SE