MLR - model assumptions Flashcards

1
Q

Violations for MLR

A
  1. Residuals with non-zero averages( E[e] <> 0)
  2. Heteroscedasticity (Var(e) <> sigma^2)
  3. Dependent e’s
  4. Non-normal e’s
  5. Outliers
  6. Collinearity
  7. High dimensions

1-5 relate to residuals which is the difference between y and y^ for an observation in the training set. Residuals should behave similar to the attributes of e; behavior that’s contrary indicates a poor model fit.

If 2-5 violated, results of t and F tests are questionable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. Residuals with non-zero averages
A

Expectation is for residuals to average close to 0. Averages are usually taken from observations that have similar predictions; those residual averages should all be close to 0. The check does not work by averaging all residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. Heteroscedasticity
A

Opposite of homoscedasticity, meaning the variance of e is not the same constant for all observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. Dependent e’s
A

If knowing e for one observation reveals new information about e for another observation, then the e’s are dependent. Dependence can be translated as ‘predictable behavior’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. Non-normal e’s
A

e’s that are not normally distributed implies the target is also not normally distributed. Severity is key.

For binary, count, or right-skewed target variables, the mismatch with a normal distribution is often substantial. A positive-valued continuous target might not be that severe of a mismatch with a normal distribution if the target values are all far above 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. Outliers
A

Definition of an outlier in this case is an observation with an extreme residual. Outliers inflate and influence the SSE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. Collinearity
A

Present when a predictor is close to being a linear combination of the other predictors (or variable that =1 for all observations). Similar predictors make it difficult for the model to distinguish which ones are truly meaningful. This can lead to unstable coefficient estimates, which may manifest as high p-values for t tests.

Perfect collinearity - a predictor is a linear combination of the other predictors. The terms singularities and rank-deficient fit also describe this phenomenon. OLS will fail to find a unique set of solutions for the coefficient estimates. One way this can occur is by having in the model a dummy variable for all levels of a factor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. High dimensions
A

Dataset is considered high-dimensional when it has too many predictors relative to the number of observations. Overfitting likely occurs in this situation.

The issues of high dimensions are rooted in the curse of dimensionality. As the number of predictors increases, more observations are needed to retain the same wealth of information, else the wealth becomes more and more diluted.

Not limited to MLR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Residual Analysis - Residuals against Predictions

A

non-zero averages
->At reasonable intervals of the predictions, the residuals should seem to average 0
->The spread of residuals is mostly constant regardless of the predictions
->The points appear random and lacking any obvious trend

heteroscedasticity
->Distinct widening or narrowing of the spread in residuals
->An increasing funnel shape may be mitigated by log-transforming the target.

discernible trend
->Will often make the other two issues meaningless/pointless to check
->Model may be missing a predictor that could explain the trend

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Residual Analysis - qq plot of residuals

A

Goal - see whether the distribution of the residuals resembles the theoretical distribution of standard normal quantiles. The closer the points are to the superimposed line, the more similar the shapes of the distributions.

95% is expected to be between -2 and 2
99.7% is expected to be between -3 and 3
An outlier may be defined as an observation whose standardized residual is more extreme than +/-3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly