Ch9 Regression Diagnostics Flashcards

1
Q

What is the White Test and why use it?

A

If you have linear regression with a constant and k regressors, y=a+b1x_1+…bk.x_k+e and n data points

the White test tests the null that there is no link between the regressors and the error term. Thus it can help rule out a violation of the heteroscedasticity assumption of an Ordinary Least Squares model.

two-step process before the hypothesis test.

1) estimate the regression above
2) regress the fitted es on a constant, plus the regressors, their squares, and any cross products. This is like a ‘Weierstrass’ polynomial approximation of some unknown nonlinear function linking the regressors to the error term, ie postulating some link. ie will have the constant plus 2k+kC2 regressors plus its own error. Get the R^2 of this second regression, r^2_2

H(0): nonconstant terms all have zero intercepts in the second regression
Lagrange multiplier LM approach: nR^2_2 is approximately chi-squared with k(k+3)/2) degrees of freedom, k being the number of regressors.
reject if this is bigger than crit(LM) at alpha significance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is multicollinearity, what are consequences and what is a tell for it?

A

Multicollinearity is present when some regressors are strongly correlated with one another.
If this is ‘perfect’ its strong and violates OLS assumptions, but is often present in real life in some weaker/imperfect sense

It can in extremis force biased estimates of slope coefficients. Even if this doesn’t happen it means estimates are noisier and may no longer be efficient

We suspect multicollinearity if F test stats are large (null of no explanatory power across the whole set rejected), yet individual T stats are weak. This is because when drilling down pre regressor, its hard to attribute explanatory power to any given regressor. This may be because the model has two many ‘similar’ regressors and could explain variation in the dependent variable and more cleanly with less regressors.

we can regress the suspected collinear variables on the other seemingly significant ones. Get the Variance Infltion Factors of these ‘junior’ regressions. If any of these VIFs are ‘excessive’ that’s a tell of which factors are problematic.

VIF = 1/(rj^2).
excessive if above 10, ie 90% of the variation in the suspected factor is explained by the other factors - ie collinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are consequences of Omitted variable bias?

A
  • included variables may be correlated to it - so estiamtes end up biased / inconsistent.
  • residuals larger since they still contain noise due to the missed explanatory factor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What if an irrelevant / extraneous regressor is included

A

This adds no information. Use the adjusted r^2 which penalises here;
r^2_adj = 1-adjustment*RSS/TSS
adjustment - (n-10/(n-k-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is bias/variance trade-off when selecting variables?

A

when estimating parameters, bias/inconsistency means we miss the target on average.
variance/precision is noise around the expected number.

Sometimes, we tolerate a bias if we feel that the noise is smaller as a compromise in seek of improving odds of ‘hitting the target.’

Some Factors driving the tradeoff:
data quality,
sample size,
whether the bias will be ‘conservative’ (context)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is general to specific model selection?

A

start large, include a lot and eliminate least individually significant. Repeat until we only have individually t significant ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is m-fold cross validation?

A

build some candidate models.
split data set into partition blocks
estiamte models using some of these (training set)
try to predict the other bit of the data (validation set)
repeat M times

select best performer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Cook’s distance?

A

loot at fitted Y, unrestricted and again when a given observation is dropped. sum squared differences and divide by ks^2 (#regressors * sample variance of errors)

‘inlier’ => small cook’s distance, ie less than unity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

when is OLS B.L.U.E?

A

1) u_i|X1…Xn is mean zero
2) [all Xs, Y] iid for all observations
3) large outliers ‘unlikely.’ finite fourth moment. This helps lean on CLTs for sampling behaviour of estimates

Note BLUE not strictly needed, though desirable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what are the classical linear regression model assumptions?

A
linearity in parameters
error term has conditional mean zero
no autocorrelation between errors
variable not correlated with error term
homoskedasticity
no perfect collinearity between regressors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what plot visualisations help you consider data problems for multivariate regressions?

A

Y: Residuals vs X: Fitted values
(even spread and centred about zero is nice for OLS assumptions)

Y: Standardised Residuals vs X: Normal percentiles (‘Normal Q-Q)
sitting on straight line suggests normality, above line at one end suggests fat
tail on that end of domain (ie maybe skew or kurtosis)

Y: Standardised Residuals vs X: Fitted values (Scale-Location)

Y: Standarised Residuals vs X: Leverage / explanatory factor unscaled.
look for cooks distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly