Linear Models Flashcards by Kitty Cheng

What do the AIC and BIC hope to accomplish?

Provide an indirect estimate of the test error by adjusting the training error (ex. SSE for MLR).

How well did you know this?

Not at all

Perfectly

How does a partial F-Test differ from a regular F-Test?

A regular F-Test tests whether the model is statistically significant (think: is at least one of the coefficients important). A partial F-Test tests whether the full model is significantly better than some nested model.

How well did you know this?

Not at all

Perfectly

Why is R^2 a poor measure for model comparison in MLR?

R^2 will always increase simply by adding more predictors to the model. A better measure is using R^2 adjusted.

How well did you know this?

Not at all

Perfectly

What is the difference between a T-Test and an F-Test?

A T-Test only tests one variable at a time; whereas, an F-Test tests for multiple variables.

How well did you know this?

Not at all

Perfectly

Why is the prediction interval always going to be wider than the confidence interval?

The prediction interval takes into account both (1) the error in the estimate of f(x) (for example, the inaccuracy of the coefficient estimates and model assumption, or bias) AND (2) the random, irreducible error.

How well did you know this?

Not at all

Perfectly

What is the difference between the confidence interval and the prediction interval?

The confidence interval reflects the error in estimating the expected value. With an infinite amount of data, the error goes to zero. The prediction interval reflects the uncertainty in the predicted observation. That uncertainty is independent of the sample size and thus the interval cannot go to zero. The additional uncertainty in making predictions about a future value as compared to estimating its expected value leads to a wider interval.

How well did you know this?

Not at all

Perfectly

For a simple linear relationship, will the confidence interval=prediction interval if the error term=0?

Yes; since the prediction interval includes the irreducible error, if that error term=0, then the confidence=prediction.

How well did you know this?

Not at all

Perfectly

What are problems that could occur when p>n, or p approximates n, when we try to use least squares regression to fit the model?

This is an example of high dimensionality. (1) A SLR would only overfit the data and become TOO flexible. (2) R^2 would be unreliable with the increase in p. Plus, R^2 adjusted could easily equal 1. (3) C(p), AIC, and BIC are inappropriate b/c estimating the variance of the predicted estimates would yield 0. (4) extreme multi-collinearity.

How well did you know this?

Not at all

Perfectly

What are methods one could use to combat high dimensional problems?

Use less flexible approaches such as forward stepwise, ridge, lasso, and PCR that are useful for performing regression in high-dimensional settings.

How well did you know this?

Not at all

Perfectly

Explain the curse of dimensionality.

You can add more p features to a model to try to improve it, but sometimes adding redundant and “noisy” coefficients may incur the trial of overfitting the model. One must be wary of whether the variance incurred in adding additional coefficients outweighs the reduction in bias that they bring.

How well did you know this?

Not at all

Perfectly

In what situation would a ridge regression outperform a lasso in terms of predictive accuracy?

Lasso regression, which performs variable selection, implicitly assumes that some of the coefficients are zeros. Ridge regression, on the other hand, does not. If the true relationship between the response involves a small number of predictors, then lasso regression will outperform ridge regression in terms of prediction accuracy. If the true relationship involves all of the predictors, then ridge regression will outperform lasso regression.

How well did you know this?

Not at all

Perfectly

How is the tuning parameter lamda in ridge and lasso regressions inversely related to flexibility?

Lamda, the tuning parameter, serves to control the relative impact that the shrinkage penalty is applied to the SSE. If lamda increases, it means that total SSE will increase (i.e. error). We have “shrunk” the estimated association of each variable with the response (coefficients reducing down to 0 as lamda increases to infinity), in order to reduce variance/flexibility.

How well did you know this?

Not at all

Perfectly

T/F: The shrinkage penalty is applied to non-intercept coefficients only.

True.

How well did you know this?

Not at all

Perfectly

In what circumstance would a ridge regression be most useful?

A ridge regression is best used where the least squares estimates have high variance, such as in problems of high dimensionality where p is close to equal to n or p»n. In this case, ridge regression performs well by trading off a small increase in bias for a large decrease in variance.

How well did you know this?

Not at all

Perfectly

Why would the lasso regression be used over the ridge regression?

A ridge regression will always generate a model involving all predictors, which becomes a problem if there are a lot of predictors when only a few are important. With the lasso, variable selection occurs where the “ell 1” penalty forces some of the coefficient estimates to be exactly equal to zero when lamda is sufficiently large. The lasso regression creates more fewer and more interpretable models than the ridge.

How well did you know this?

Not at all

Perfectly

What is the bias of OLS estimates?

Zero. OLS is unbiased.

How well did you know this?

Not at all

Perfectly

What is a good direct estimate for the test error?

Study These Flashcards

Cross-Validation

T/F: When comparing two subsets with the same n, p, and MSE (of the full model with all g predictors), the AIC and the BIC will select the same subset as the better model.

Study These Flashcards

T. When n, p, and MSE(g) are the same between two subsets, then the problem simply reduces down to comparing the SSE of the two models. The one with the lower SSE will have the AIC and the BIC choosing it.

What is the similarity and difference between a likelihood ratio test in a GLM setting vs. a partial F-test?

Study These Flashcards

A likelihood ratio test and a partial F-test both require that the models in comparison be nested. The difference is that a partial F-test requires the response to be normally distributed.

What is deviation in GLM measuring?

Study These Flashcards

The performance of the model. In MLR, that deviation is the SSE.

What is the purpose of Goodness-of-Fit tests?

Study These Flashcards

It is to determine whether a simpler model than a GLM is sufficient. The null hypothesis is that a simple model is sufficient. Rejecting the null suggests that a GLM may be necessary.

What is the SSR of a null model?

Study These Flashcards

SSR=0 b/c there are no predictors to explain the variability.

Using the same dataset, can the SST of one linear model be the same as another?

Study These Flashcards

Yes, it should be the same. The SST is calculated using observed values against the average of the dataset, so adding/taking away predictors should NOT impact SST.

What is the purpose of a weighted least squares approach?

Study These Flashcards

It helps maintain MLR assumptions except for the homoscedasticity requirement.

T/F: KNN regression performs well in high-dimensions.

False. It performs poorly in high dimensions.

Does centering and/or scaling the predictors intrinsically change the fitted equation?

No; Centering and/or scaling the predictors does not intrinsically change the fitted equation.

When centered predictors are used, what happens to the fitted equation?

When centered predictors are used, the intercept estimate will equal the sample mean of the response, while the rest of the estimated regression coefficients remain unchanged. However, if you expand the newly centered equation, you will find that the equation is the same as the original, un-centered equation.

When When scaled predictors are used, what happens to the fitted equation?

When scaled predictors are used, the intercept estimate does not change. However, the other coefficient estimates will change by a factor equal to the sample standard deviation of the associated explanatory variable. However, if you expand the newly centered equation, you will find that the equation is the same as the original, un-centered equation.

T/F: For partial least squares with g directions, the regression's variance decreases as g decreases.

True. The number of directions used in partial least squares corresponds to flexibility. Decreasing g leads to lower flexibility, which leads to lower variance.

What happens when g=p in partial least squares?

no dimension reduction occurs and the PLS regression is no different from the original OLS regression

T/F: Partial least squares improves over ordinary least squares in the sense that it is not as biased.

False. Partial least squares improves over ordinary least squares by reducing dimensions and thus reducing variance. By the bias-variance trade-off, this means partial least squares is more biased than ordinary least squares.

T/F: Partial least squares improves over weighted least squares in the sense that it is not as biased.

False. Weighted least squares is not a method concerned with flexibility; from that aspect, there is no way to compare it to partial least squares.

T/F: The partial least squares directions are a supervised, low-dimensional representation of the original features.

True. The directions reduce the features to a lower dimension in a way that makes use of the response variable.

T/F: Partial least squares performs variable selection on the original features.

False. All features are used to compute the partial least squares directions. No part of the partial least squares procedure forces any of the features to be excluded from the model.

How is partial least squares different from principal components analysis?

PLS directions summarize the original predictors using information from the response, y (see loading formula). While PCA is also a dimension reduction method, it does not rely on the response, y, when calculating new predictors. PLS is also a supervised learning method, whereas PCA is not (however, you can make it supervised but utilizing principal components regression (PCR)). Also: The difference between Partial Least Squares and Principal Components Regression is that Principal Components Regression focuses on variance while reducing dimensionality. Partial Least Squares on the other hand focuses on covariance while reducing dimensionality.

At what point in the PCA determination process does a "PC" not considered a distinct PC?

When it has zero variance. A distinct PC is one where the variance is non-zero.

T/F: Loadings are unique in PCA.

False. Loadings are only unique up to a sign flip (though the sign flips should be consistent, i.e. the loadings for each variable should flip). Remember that loadings in a principal component are calculated to maximize the variance in a dataset. The SQUARED sum of the loadings are constricted to 1 (normalizing), but because they are squared, the same loading for a variable can be negative or positive, but still give us the same principal component.

Is flexibility and predictive accuracy related?

No. Flexibility is not directly related to prediction accuracy. Remember that we measure prediction accuracy using test error, not training error.

Of the three model selection methods - best subset, forward, and backward - which method produces the smallest training error?

The model identified by the best subset selection has the smallest training residual sum of squares. Since forward stepwise selection and backward stepwise selection are greedy, there is no guarantee that the k -predictor model identified by forward stepwise selection has the smallest training residual sum of squares.

T/F: R2 is the squared sample correlation of y and y hat. This is true for both SLR and MLR.

This statement is true. This should not be confused in regards to correlation between x and y hat, where, in the instance of SLR, are perfectly correlated - but not so in MLR.

Why is it wrong to strictly rely on t-stats and p-values to test the significance of a variable in a fitted model? How can we combat this issue?

If we consider a model with n=100 observations and a lot of predictors, we will expect a small number of variables to have small p-values even in the absence of any true association between the predictors and the response. To combat this, we use F-stat b/c it adjusts for the number of predictors - although a caveat is that F-stat only works well when p is relative small in comparison to n. F-stat falls apart when high dimensionality exists.

Of the three model selection methods - best subset, forward, and backward - which method falls

Linear Models Flashcards

(42 cards)