Linear Models Flashcards
What do the AIC and BIC hope to accomplish?
Provide an indirect estimate of the test error by adjusting the training error (ex. SSE for MLR).
How does a partial F-Test differ from a regular F-Test?
A regular F-Test tests whether the model is statistically significant (think: is at least one of the coefficients important). A partial F-Test tests whether the full model is significantly better than some nested model.
Why is R^2 a poor measure for model comparison in MLR?
R^2 will always increase simply by adding more predictors to the model. A better measure is using R^2 adjusted.
What is the difference between a T-Test and an F-Test?
A T-Test only tests one variable at a time; whereas, an F-Test tests for multiple variables.
Why is the prediction interval always going to be wider than the confidence interval?
The prediction interval takes into account both (1) the error in the estimate of f(x) (for example, the inaccuracy of the coefficient estimates and model assumption, or bias) AND (2) the random, irreducible error.
What is the difference between the confidence interval and the prediction interval?
The confidence interval reflects the error in estimating the expected value. With an infinite amount of data, the error goes to zero. The prediction interval reflects the uncertainty in the predicted observation. That uncertainty is independent of the sample size and thus the interval cannot go to zero. The additional uncertainty in making predictions about a future value as compared to estimating its expected value leads to a wider interval.
For a simple linear relationship, will the confidence interval=prediction interval if the error term=0?
Yes; since the prediction interval includes the irreducible error, if that error term=0, then the confidence=prediction.
What are problems that could occur when p>n, or p approximates n, when we try to use least squares regression to fit the model?
This is an example of high dimensionality. (1) A SLR would only overfit the data and become TOO flexible. (2) R^2 would be unreliable with the increase in p. Plus, R^2 adjusted could easily equal 1. (3) C(p), AIC, and BIC are inappropriate b/c estimating the variance of the predicted estimates would yield 0. (4) extreme multi-collinearity.
What are methods one could use to combat high dimensional problems?
Use less flexible approaches such as forward stepwise, ridge, lasso, and PCR that are useful for performing regression in high-dimensional settings.
Explain the curse of dimensionality.
You can add more p features to a model to try to improve it, but sometimes adding redundant and “noisy” coefficients may incur the trial of overfitting the model. One must be wary of whether the variance incurred in adding additional coefficients outweighs the reduction in bias that they bring.
In what situation would a ridge regression outperform a lasso in terms of predictive accuracy?
Lasso regression, which performs variable selection, implicitly assumes that some of the coefficients are zeros. Ridge regression, on the other hand, does not. If the true relationship between the response involves a small number of predictors, then lasso regression will outperform ridge regression in terms of prediction accuracy. If the true relationship involves all of the predictors, then ridge regression will outperform lasso regression.
How is the tuning parameter lamda in ridge and lasso regressions inversely related to flexibility?
Lamda, the tuning parameter, serves to control the relative impact that the shrinkage penalty is applied to the SSE. If lamda increases, it means that total SSE will increase (i.e. error). We have “shrunk” the estimated association of each variable with the response (coefficients reducing down to 0 as lamda increases to infinity), in order to reduce variance/flexibility.
T/F: The shrinkage penalty is applied to non-intercept coefficients only.
True.
In what circumstance would a ridge regression be most useful?
A ridge regression is best used where the least squares estimates have high variance, such as in problems of high dimensionality where p is close to equal to n or p»n. In this case, ridge regression performs well by trading off a small increase in bias for a large decrease in variance.
Why would the lasso regression be used over the ridge regression?
A ridge regression will always generate a model involving all predictors, which becomes a problem if there are a lot of predictors when only a few are important. With the lasso, variable selection occurs where the “ell 1” penalty forces some of the coefficient estimates to be exactly equal to zero when lamda is sufficiently large. The lasso regression creates more fewer and more interpretable models than the ridge.
What is the bias of OLS estimates?
Zero. OLS is unbiased.