03 Regression Diagnostics Flashcards

1
Q

Gauss-Markov Theorem: What does it state?

What does BLUE mean?

What’s the difference between bias, consistent and efficient?

A

BLUE is given by the OLS estimator in a linear regression model only if the error terms have expected value of zero, are uncorrelated and have constant variance.

Best(lowest variance) linear unbiased(Expected value of estimator is true value of estimator) ESTIMATOR

bias-> mentioned above
consistent-> variance of estimator decreases with higher sample size
efficient-> estimator has the lowest variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Assumptions of Gauss-Markov: What are they?

OLS estimator is blue if and only if …

A

linear relationship in parameters beta

no multicollinearity of/linear dependency between predictors

homoscedasticity–> residuals have constant variance

no autocorrelation-> no correlation between 2 residual terms

expected value of the residual vector is zero, i.e. exogeneity cov(error term, X) = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Assumption #1: Linear relationship in parameters beta.

How to remedy?

How to deal with outliers? Why do we get outliers anyway?

A

Reformulate using X^2, lnX or segmentation

Use cook’s distance to measure effect of deleting an observation. difference in y-y without k summed up and divided by (#number of independent variables* MSE of model)

Error in recording value, Point does not belong to the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Assumption #2: No multicollinearity between predictors

If two independent variables were dependent,
To check for linear dependencies of two columns in a matrix,…

Also high correlation between independent variables leads to issues wrt….

How to check for multicollinearity? (2 ways)
P.S.It is possible that the pairwise correlations are small, and yet a linear dependence exists among three or even more variables.

A

… one could easily omit one.

use the rank of the X matrix. If rank(X) < p+1 -> multicollinearity

the significance of predictors.

check correlation coefficient of each predictor pairs.
OR: Use the Variance Inflation Factor (VIF).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Variance Inflation Factor: test for what? intuition? meaning of VIF = 10?

What is the consequence of non-significance?

A

Use to test for multicollinearity–> linear dependency between predictors.

Intuition: Set the predictors as dependent variable and find R^2 (% of explained variance by the other predictors). VIF = 1(1-R^2 of k)

e.g. a VIF of 10 means that R^2 of k is 90%.–>90% of the variance of this predictor is explained by other predictors. REMOVE IT.

=> small t-value.
1.high VIF–>variable is related to the response, but it is not required in the regression because it is strongly related to a third variable that is in the regression, so we drop it from the model
2. low VIF-> variable not related to response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Assumption #3 Homoscedasticity.

6th MG assumption; what does violating this assumption mean?

What tests to use and how do those work?

A

Assume residuals are normally distributed with constant variance

violation==> heteroscedasticity=> biased error terms and p-values of significance tests

Glejser test: explain the concept.

White test: explain the concept.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Assumption #4: No autocorrelation

How to test?
Reasons for autocorrelation?

A

Autocorrelation can be detected by graphing the residuals against time, or Durbin- Watson statistic.

Reasons leading to autocorrelation:
* Omitted an important variable
* Functional misfit
* Measurement error in independent variable

Use Durbin-Watson (DW) statistic to test for first-order autocorrelation. DW takes values within [0, 4]. For no serial correlation, a value close to 2 (e.g., 1.5-2.5) is expected. (0: perfect +ve autocorrelation)

D= summation over i=2 to n (ei - e(i-1))^2/ summation of ei^2 over i=1 to n

Can also use Breusch-Godfrey test for higher order auto-regressive schemes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Breusch-Godfrey test

A

The Breusch-Godfrey test also considers higher-order autoregressive schemes:

Estimate the regression using OLS and obtain the 𝑅^2

If the sample size is large, Breusch and Godfrey have shown that (𝑛 – 𝑝)𝑅^2 follow a
𝜒2-distribution.

The null hypothesis is that there is no autocorrelation and 𝜌𝑗 = 0.

If (𝑛 – 𝑝)𝑅^2 exceeds the critical value at the chosen level of significance, we reject the null hypothesis, in which case at least one 𝜌 is statistically different from zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Assumption #5 Exogeneity (i.e. Expected value of error terms is zero). Why does it happen that it’s violated?

A

Reason for endogeneity: measurement errors, variables that affect each other,
omitted variables(!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

e) The computed R2 value is not very good (even with the intercept). What could be the reason?

ii) Which scales of measurement do the variables belong to (e.g., nominal, ordinal, interval or ratio)?

A

e) The model choice could be an inadequate match. Nonlinear transformations of the input variables (i.e. generalized least squares) could provide a better solution. In the solution script, you can find a generalized least-squares model where we added quadratic terms.

  • Nominal – Categorical data with no meaningful order (e.g., colors, gender, names).
  • Ordinal – Categorical data with a meaningful order but no fixed intervals (e.g., rankings, satisfaction levels).
  • Interval – Numeric data with equal intervals but no true zero (e.g., temperature in Celsius, IQ scores).
  • Ratio – Numeric data with equal intervals and a true zero (e.g., weight, height, income).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly