Ch9 Regression Diagnostics Flashcards
What is the White Test and why use it?
If you have linear regression with a constant and k regressors, y=a+b1x_1+…bk.x_k+e and n data points
the White test tests the null that there is no link between the regressors and the error term. Thus it can help rule out a violation of the heteroscedasticity assumption of an Ordinary Least Squares model.
two-step process before the hypothesis test.
1) estimate the regression above
2) regress the fitted es on a constant, plus the regressors, their squares, and any cross products. This is like a ‘Weierstrass’ polynomial approximation of some unknown nonlinear function linking the regressors to the error term, ie postulating some link. ie will have the constant plus 2k+kC2 regressors plus its own error. Get the R^2 of this second regression, r^2_2
H(0): nonconstant terms all have zero intercepts in the second regression
Lagrange multiplier LM approach: nR^2_2 is approximately chi-squared with k(k+3)/2) degrees of freedom, k being the number of regressors.
reject if this is bigger than crit(LM) at alpha significance
What is multicollinearity, what are consequences and what is a tell for it?
Multicollinearity is present when some regressors are strongly correlated with one another.
If this is ‘perfect’ its strong and violates OLS assumptions, but is often present in real life in some weaker/imperfect sense
It can in extremis force biased estimates of slope coefficients. Even if this doesn’t happen it means estimates are noisier and may no longer be efficient
We suspect multicollinearity if F test stats are large (null of no explanatory power across the whole set rejected), yet individual T stats are weak. This is because when drilling down pre regressor, its hard to attribute explanatory power to any given regressor. This may be because the model has two many ‘similar’ regressors and could explain variation in the dependent variable and more cleanly with less regressors.
we can regress the suspected collinear variables on the other seemingly significant ones. Get the Variance Infltion Factors of these ‘junior’ regressions. If any of these VIFs are ‘excessive’ that’s a tell of which factors are problematic.
VIF = 1/(rj^2).
excessive if above 10, ie 90% of the variation in the suspected factor is explained by the other factors - ie collinearity.
What are consequences of Omitted variable bias?
- included variables may be correlated to it - so estiamtes end up biased / inconsistent.
- residuals larger since they still contain noise due to the missed explanatory factor
What if an irrelevant / extraneous regressor is included
This adds no information. Use the adjusted r^2 which penalises here;
r^2_adj = 1-adjustment*RSS/TSS
adjustment - (n-10/(n-k-1)
What is bias/variance trade-off when selecting variables?
when estimating parameters, bias/inconsistency means we miss the target on average.
variance/precision is noise around the expected number.
Sometimes, we tolerate a bias if we feel that the noise is smaller as a compromise in seek of improving odds of ‘hitting the target.’
Some Factors driving the tradeoff:
data quality,
sample size,
whether the bias will be ‘conservative’ (context)
what is general to specific model selection?
start large, include a lot and eliminate least individually significant. Repeat until we only have individually t significant ones
what is m-fold cross validation?
build some candidate models.
split data set into partition blocks
estiamte models using some of these (training set)
try to predict the other bit of the data (validation set)
repeat M times
select best performer
What is Cook’s distance?
loot at fitted Y, unrestricted and again when a given observation is dropped. sum squared differences and divide by ks^2 (#regressors * sample variance of errors)
‘inlier’ => small cook’s distance, ie less than unity.
when is OLS B.L.U.E?
1) u_i|X1…Xn is mean zero
2) [all Xs, Y] iid for all observations
3) large outliers ‘unlikely.’ finite fourth moment. This helps lean on CLTs for sampling behaviour of estimates
Note BLUE not strictly needed, though desirable
what are the classical linear regression model assumptions?
linearity in parameters error term has conditional mean zero no autocorrelation between errors variable not correlated with error term homoskedasticity no perfect collinearity between regressors
what plot visualisations help you consider data problems for multivariate regressions?
Y: Residuals vs X: Fitted values
(even spread and centred about zero is nice for OLS assumptions)
Y: Standardised Residuals vs X: Normal percentiles (‘Normal Q-Q)
sitting on straight line suggests normality, above line at one end suggests fat
tail on that end of domain (ie maybe skew or kurtosis)
Y: Standardised Residuals vs X: Fitted values (Scale-Location)
Y: Standarised Residuals vs X: Leverage / explanatory factor unscaled.
look for cooks distance