MLR - model assumptions Flashcards

Question 1

Q

Violations for MLR

Answer

A

Residuals with non-zero averages( E[e] <> 0)
Heteroscedasticity (Var(e) <> sigma^2)
Dependent e’s
Non-normal e’s
Outliers
Collinearity
High dimensions

1-5 relate to residuals which is the difference between y and y^ for an observation in the training set. Residuals should behave similar to the attributes of e; behavior that’s contrary indicates a poor model fit.

If 2-5 violated, results of t and F tests are questionable

Question 2

Q

Residuals with non-zero averages

Answer

A

Expectation is for residuals to average close to 0. Averages are usually taken from observations that have similar predictions; those residual averages should all be close to 0. The check does not work by averaging all residuals.

Question 3

Q

Heteroscedasticity

Answer

A

Opposite of homoscedasticity, meaning the variance of e is not the same constant for all observations

Question 4

Q

Dependent e’s

Answer

A

If knowing e for one observation reveals new information about e for another observation, then the e’s are dependent. Dependence can be translated as ‘predictable behavior’

Question 5

Q

Non-normal e’s

Answer

A

e’s that are not normally distributed implies the target is also not normally distributed. Severity is key.

For binary, count, or right-skewed target variables, the mismatch with a normal distribution is often substantial. A positive-valued continuous target might not be that severe of a mismatch with a normal distribution if the target values are all far above 0.

Question 6

Q

Outliers

Answer

A

Definition of an outlier in this case is an observation with an extreme residual. Outliers inflate and influence the SSE.

Question 7

Q

Collinearity

Answer

A

Present when a predictor is close to being a linear combination of the other predictors (or variable that =1 for all observations). Similar predictors make it difficult for the model to distinguish which ones are truly meaningful. This can lead to unstable coefficient estimates, which may manifest as high p-values for t tests.

Perfect collinearity - a predictor is a linear combination of the other predictors. The terms singularities and rank-deficient fit also describe this phenomenon. OLS will fail to find a unique set of solutions for the coefficient estimates. One way this can occur is by having in the model a dummy variable for all levels of a factor.

Question 8

Q

High dimensions

Answer

A

Dataset is considered high-dimensional when it has too many predictors relative to the number of observations. Overfitting likely occurs in this situation.

The issues of high dimensions are rooted in the curse of dimensionality. As the number of predictors increases, more observations are needed to retain the same wealth of information, else the wealth becomes more and more diluted.

Not limited to MLR

Question 9

Q

Residual Analysis - Residuals against Predictions

Answer

A

non-zero averages
->At reasonable intervals of the predictions, the residuals should seem to average 0
->The spread of residuals is mostly constant regardless of the predictions
->The points appear random and lacking any obvious trend

heteroscedasticity
->Distinct widening or narrowing of the spread in residuals
->An increasing funnel shape may be mitigated by log-transforming the target.

discernible trend
->Will often make the other two issues meaningless/pointless to check
->Model may be missing a predictor that could explain the trend

Question 10

Q

Residual Analysis - qq plot of residuals

Answer

A

Goal - see whether the distribution of the residuals resembles the theoretical distribution of standard normal quantiles. The closer the points are to the superimposed line, the more similar the shapes of the distributions.

95% is expected to be between -2 and 2
99.7% is expected to be between -3 and 3
An outlier may be defined as an observation whose standardized residual is more extreme than +/-3

MLR - model assumptions Flashcards

(10 cards)