Topic 3: Regression Diagnostics Flashcards
linear regression assumptions
- linearity
- normality
- homoscedasticity
- independence
- outliers
- multicollinearity
linearity
the relationship between x and y is linear
normality
the error term follows a normal distribution
homoscedasticity
the error term has a mean 0 & a constant variance
independence
the error terms are not related to each otehr
outliers
there are no outliers
multicollinearity
there are no high correlations among IVs
testing normaltiy
skewness & kurtosis, shapiro-wilk test, normal quantile plot
skewness
the spread of the data
kurtosis
how peaked the data are
interpreting skewness & kurtosis
if t skewness or t kurtosis > 3.2, violation of the respective assumption
shapiro-wilk test
tests for normality
null hypothesis of shapiro-wilk test
the sample comes from a normal distribution
interpreting shapiro-wilk results
significant result = may not come from a normal distirbution
normal quantile plot
sorts observations from smallest to largest, calculates z-scores of the sorted observations, and plots the observations against corresponding z-scores
intepreting normal quantile plot
if close to normal, the points will lie close to some straight line
dealing with non-normality
data transformation or resampling methods (ex., bootstrap, jackknife)
bootstrap
uses resampling with replacements to emulate the process of obtaining new samples so that we can estimate the variability of a parameter estimate without generating additional samples
what happens if homoscedasticity is violated?
- the variances of regression coefficient estimates tend to be under-estimated
- thus, t-ratios tend to be inflated
testing homoscedasticity
residual plots
residuals
differences between Yi & Ŷi
interpreting residual plots for homoscedasticity
funnel shape = violation of homoscedasticity
dealing with heteroscedasticity
data transformation, other estimation methods, other regression methods
testing linearity
residual plots
interpreting residual plots for linearity
curve shape = violation of linearity
dealing with non-linearity
data transformation, add another IV to the equation (non-linear function of one of the other IVs), use non-linear methods
testing independence
Durbin-Watson (d) of autocorrelation
Durbin-Watson test
tests the correlation between error terms ordered in time or space
interpreting Durbin-Watson test results
1.5-2.5 = normal
below 1 or above 3 = abnormal
dealing with dependence
data transformation, use other estimate methods, use other regression methods
outlier
a data point disconnected from the rest of the datas
checking outliers
cook’s distance
interpreting Cook’s distnace
Cook’s D > 4 suggests potentially serious outliers
dealing with outliers
if an unusual case is not likely to reoccur, delete the case or use robust regression
consequences of multicollinearity
- unstable regression coefficient estimates (lower t-ratios)
- a high r2 (or significant F) but few significant t-ratios
- unexpected signs of regression coefficients
- the matrix inversion problem
checking multicollinearity
tolerance, VIF, condition index
tolerance
R2 for the regression of each IV on the other IVs, ignoring the DV
interpreting tolerance
values < 0.1 = multicollinearity problem
variance inflation factor (VIF)
1/tolerance
interpreting VIF
values > 10 = multicollinearity problem
condition index
measure of dependency of one variable on the others
interpreting condition index
values > 30 = multicollinearity
dealing with multicollinearity
drop a variable, incorporate more info (composite variable), or use other regression methods
r2 vs. R2 in linear regression
- Simple linear regression: r2 = R2
- Multiple linear regression: r2 ≠ R2