2. Linear Models Flashcards by Marija Spehar

Difference between prediction and confidence interval in MLR.

Confidence: range for the mean response
Prediction: range for a response value

How well did you know this?

Not at all

Perfectly

What is the hierarchical principle with interaction terms in MLR?

A significant interaction term implies that its individual terms should also be in the model, regardless of the t tests associated with the individual terms

How well did you know this?

Not at all

Perfectly

Model diagnostics: what is misspecified model equation? Give an example

Incorrectly assuming that the true form of f follows your model.

When there is evidence of a higher order polynomial relationship.

How well did you know this?

Not at all

Perfectly

Model diagnostics: residuals with non-zero averages. What does this mean

Residuals are realizations of the true error terms that come from a normal distribution, this means that some aspect of linear regression is incorrect

How well did you know this?

Not at all

Perfectly

Model diagnostics: heteroscedasticity. This leads to an unreliable _____

Variance of the error term is not constant, there is evidence of more than one variance parameters.

This leads to an unreliable MSE, then all outputs that rely on MSE are also unreliable

How well did you know this?

Not at all

Perfectly

Model diagnostics: dependent errors, what does this mean in terms of the Y’s? The ___ will also be underestimated, leading to ____ CI and PI intervals.

This means that the Y’s have non-zero covariances.

The standard errors will be underestimated, this will make the intervals narrower and p-values smaller.

How well did you know this?

Not at all

Perfectly

Model diagnostics: why is it bad if error terms are non-normal?

Then we are unable to make inferences based on the F and t distributions.

How well did you know this?

Not at all

Perfectly

Model diagnostics: multicollinearity. What is this, and what does it lead to?

When one of the predictors is correlated with another predictor. This means that the estimates of the regression coefficients will be unstable.

How well did you know this?

Not at all

Perfectly

Does multicollinearity affect the predictive power of y-hat, MSE or F test results?

How well did you know this?

Not at all

Perfectly

Model diagnostics: unusual points. What are these and what do they do to the model?

Outliers: extreme residuals
High leverage point: observation with an unusual set of predictor values. The bj’s are sensitive to these points and could affect the shape of the model greatly.

How well did you know this?

Not at all

Perfectly

Model diagnostics: high dimensions, what does this mean?

The model is too flexible, it overfits the data

How well did you know this?

Not at all

Perfectly

Which of the following can challenge the interpretation of a regression coefficient?

Misspecified model equation, multicollinearity, high leverage points

Misspecified model equation does make interpreting the bj’s problematic.

Multicollinearity masks which predictors are actually meaningful to the model.

High leverage points have a strong influence over the bj’s.

Answer: all

How well did you know this?

Not at all

Perfectly

What is the formula for leverage? In SLR?

Formula sheet

How well did you know this?

Not at all

Perfectly

High leverage point is given by what inequality?

h > 3((p+1)/n)

How well did you know this?

Not at all

Perfectly

What is a studentized residual? They can be a realization of what distribution?

A unit less version of a residual, they are the raw residuals divided by an appropriate standard error.

They can be a realization of a t distribution with df = n-p-1

How well did you know this?

Not at all

Perfectly

What is the formula for Cooks Distance? What does it measure, what distribution is it a realization of? An observation has a typical influence if D = ?

Formula sheet

Measures influence, realization of the F distribution with ndf = p+1 and ddf = n-p-1

Typical influence if D = 1/n

How well did you know this?

Not at all

Perfectly

In the plot of e vs y-hat; what makes residuals well behaved?

Points are randomly scattered and lacking trends.
If the residuals seem to be acting as a function of y-hat, then the model is likely missing a predictor that can explain the trend.
Ex. U-shaped … add a positive quadratic term
Non-zero average of residuals.
Equally spread above and below the 0 line
Heteroscedasticity.
The residuals have inconsistent spread (cone-like shapes)

How well did you know this?

Not at all

Perfectly

How may we solve the issue of heteroscedasticity?

Cone shape toward infinity: transform the response using log, or square root (any concave function)

Cone shape toward 0: weighted least squares

How well did you know this?

Not at all

Perfectly

How may we solve the issue of dependent errors?

We use time series

How well did you know this?

Not at all

Perfectly

How may we solve the issue of non-normal errors?

This occurs when the response is discrete in nature.

How well did you know this?

Not at all

Perfectly

How may we solve the issue of multicollinearity?

Exclude all but one of those predictors from the model
Combine the predictors
Do nothing and report it’s presence
Use orthogonal predictors, then we know that they are uncorrelated

How well did you know this?

Not at all

Perfectly

What is a suppressor variable? Should we add these to our model?

This is when multicollinearity is accepted.

This is a predictor that is weakly correlated with the response, but due to being related to other predictors, it enhances their usefulness. This means that adding a suppressor variable leads to a better model, even if it produces multicollinearity.

How well did you know this?

Not at all

Perfectly

What happens when the residuals exhibit a predictable pattern from observation to observation?

Use a time series model

How well did you know this?

Not at all

Perfectly

What is the e vs i plot used to detect?

Dependence of error terms

How well did you know this?

Not at all

Perfectly

Is forward selection a greedy approach?

Yes, because it only adds the next best predictor p as p increases, rather than finding the best subset of predictors

What is the algorithm for backwards selection?

Fit the full model, drop the predictor that results in the highest SSE.

Disadvantage of both forward and backward selection?

These two procedures will result in nested models. There is no certainty that the absolute best model will be found.

Which of backwards and forwards handles higher dimensions better?

Forward

Besides R^2a, what are 4 other criterion’s that capture model quality? Would we like to minimize or maximize these values?

Mallows’ Cp AIC BIC CV error We wish to minimize these statistics

AIC,BIC,Cp mimic what statistic?

Test MSE

CP is an unbiased estimate of _____ if it is calculated using an unbiased estimate of sigma.

Test MSE

When a model is overfit p, what selection criterion (AIC,BIC,Cp, R^2a) will be unreliable?

All of them, they are all functions of SSE

What is the formula for LOOCV error p?

Formula sheet

Which type of CV will overestimate the test MSE and which will not have this issue?

Validation set, there are only 2 subgroups, meaning the data will vary quite a bit if a different split is chosen. KFOLD & LOOCV will not have this issue

CV: in order of least to most biased.

LOOCV , Kfold, validation set

CV: in order of least to most variance

Validation set , kfold, LOOCV

Ridge and Lasso regression are _____ regression techniques. They aim to decrease a models ______. They both use _____ predictors.

Shrinkage methods Decrease variance Scaled predictors

The tuning parameter in lasso and ridge is inversely related to ____. How is this tuning parameter chosen?

Flexibility. The parameter is chosen using CV

Ridge is preferred to lasso when _____.

When we are sure that the response is a function of all predictors. Because lasso may drop a predictor.

Are both lasso and ridge useful when dealing with high dimensions?

Yes, as only some of the predictors will have a meaningful estimated coefficient.

True or false: lasso is subjectively better than ridge

False. There is no clear advantage between the two.

Which of lasso or ridge is easier to interpret when considering many features?

Lasso, as it can drop predictors

What is tolerance?

Reciprocal of VIF

If one dummy variable is not significant to the model (summer is not significant, but winter and fall are), should it be dropped?

No, it leads to altering the categorical variable to have w-1 classes instead of w. Further investigations are needed

Can backward selection be done if n

No, backwards requires n to be larger than p.

Performing best subset selection requires fitting all ____ models. (A number)

2^p

In lasso regression, which of the following is true? A. As lambda increases, the number of predictors in the model will increase. B. As lambda increases, the bias of the parameters in the chosen model will increase. C. As lambda increases, the variance of the predictions made by the chosen model will increase.

How many models are fit in forwards selection?

1+(p(p+1))/2

In ridge regression, as the BUDGET parameter increases, which of the following will occur? A. Training SSE will steadily inscrease B. Test SSE will have an upside down U shape C. variance will steadily decrease D. Squared bias will follow a U shape E. Irreducible error will remain constant

In ridge regression, as budget parameter increases -> flexibility increases A. False. Training SSE decreases B. False. Test MSE follows a u shape C. False. Variance increases. D. False. Squared bias decreases E. True. Irreducible error always remains constant.

In lasso and ridge, is the budget parameter inversely related with the lambda that is in our optimization problem?

Yes

If E=0, the 95% confidence interval will be equal to the 95% prediction interval.

True

For OLS, the residuals must sum to __.

How do we calculate CV error in k fold CV?

We average the k test MSEs

How do we calculate CV error in validation set CV?

Using the other half of the data, the testing set, we estimate the test MSE

With high dimensional data, which of the following become unreliable? A. Fitted equation B. R^2 C. Confidence intervals for regression coefficients

All 3. | The fitted equation equation would fail to generalize.

In the presence of heteroscedasticity, which of the following become unreliable? Adjusted R^2 VIF F test

Adjusted R^2 and F test. VIF doesn’t have SSE as an input. Unrelated to heteroscedasticity

When there is perfect multicollinearity, which of the following are true? A. It is impossible to determine which predictors are truly significant B. There is an issue with estimating the regression coefficients. C. There is no concern with the MSE

All true. As coefficients become less reliable, they mask which predictors are actually significant. With perfect multicollinearity the OLS estimates are no longer unique. MSE is still reliable because the predictive ability of y-hat is not in question.

Are ridge and lasso scaled with variance or standard deviation?

Standard deviation

Is linear regression a flexible approach?

No, it’s relatively inflexible. Can only generate linear functions.

KNN Regression. Which of the following are true? A. N must be large to produce good predictions because this is a non-parametric method B. It performs well in high dimensions C. It will outperform linear regression when the chosen functional form poorly approximates the true relationship between x and y.

A. True B. False. It doesn’t perform well in high dimensions. C. True

When n=p+1, what happens to the model?

It fits the data perfectly. So y-hat = y, SSE=0,

When we use (best subset, forward, backward) with CV, do we use the training data set to build our model or the entire data set?

Best subset/forward/backward is to be only done on the training set.

In the CV algorithms, are the validation set errors used to find the optimal number of p? Or the optimal entire model?

Optimal number of p. The optimal model is determined by using all the data points, not just the training set.

2. Linear Models Flashcards

(63 cards)