Chapter 3 - Problems with Linear Regression Flashcards

1
Q

Potential issues in linear regression (7)

A

1) interactions between predictors 2) non-linear relationships 3) correlation of error terms 4) non-constant variance of error (heteroscedasticity) 5) outliers 6) high leverage points 7) colinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Potential Issue with Linear Regression : Interactions between predictors

A

linear regression has additive assumption (i.e., that an increase in TV ads will cause a fixed increase in sales regardless of how much you spend on radio ads). Visualizing the data proves this false.

Solution: Add multiplicative variable in the model (e.g., … + Beta3 * (TVradio) + …). Rcode for this is lm(predict ~ tv:radio) = lm(predict ~ tv + radio + tvradio)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Potential Issue with Linear Regression : Non-linear relationships

A

a scatterplot may reveal a non-linear relationship. (if you have more than 2 predictors, look at residuals. if residuals go pos-neg-pos its unusual).

Solution: include polynomial terms in the model (e.g., … + Beta2 * horsepower^2 + …)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Potential Issue with Linear Regression : Correlation of Error terms

A

We assumed that the errors for each sample are independent and identically distributed (i.e., “i.i.d”). If this breaks down it invalidates many assumptions: standard error, CI, hypothesis tests. real life examples: time series, spatial (plot residuals vs that variable).

Solution: use a model that takes this into account

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Potential Issue with Linear Regression : Non-constant variance of error

A

also known as heteroskedasticity. The variance of the error depends on input (you see a cone or trend in the residuals vs fitted values).

Solution: If trend in variance is relatively simple, we can transform the response using a logarithm (for example) (plotting log(Y) vs log(Yhat))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Potential Issue with Linear Regression : Outliers

A

definition = points with very high residuals. Studentized residuals are residuals normalized by RSS (how many st dev the point is away from the line). Outliers may not affect the fit, but does effect out estimate of model quality or fit.

Solutions: remove if error in data collection. Outlier could point to need for more complex model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Potential Issue with Linear Regression : High Leverage points

A

Some sample with extreme inputs have an outsized effect on Bhat. you can find leverage points using the leverage statistic (h), which measures the deviation from the mean.

For simple linear regression: use leverage statistic.
For multiple linear regression: use HAT matrix. If H is the HAT matrix, a point (xi, yi) has high leverage if Hii is a large value (> 0.3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Potential Issue with Linear Regression : Colinearity

A

two predictors are co-linear if one explains the other well (e.g., limit = a * rating + b, they contain the same information). problem is that coefficients become unidentifiable (many coefficients will give you the same results. e.g., balance = a + b1 * limit + b2 * limit. the fit {a,b1,b2} will be just as good as {a,b1+100,b2-100}).

Solution: If 2 variables are collinear, we can easily diagnose this using correlation. If q variables are multilinear (i.e., contain less information than q independent variables) we can use VIF (variance inflation factor, the R^2 for multiple linear regression of Xj onto remaining predictors) checking to find Xj that have high R^2 with other predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Comparing Linear Regression and KNN (definition, when is KNN better? when not?)

A

1) linear regression: prototypical parametric method, KNN regression prototypical nonparametric method 2) KNN is only better than linear regression with f isn’t linear 3) however, if p&raquo_space; n, even if f is very non-linear, linear regression is better (comes at the price of higher variance) because when p&raquo_space; n, each sample has no nearest neighbors (known as the curse of dimensionality)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly