Chapter 5 - Cross Validation Flashcards

Question 1

Q

Validation Set Approach

Answer

A

split into two parts, estimate the test error for a supervised learning method. Do this splitting many times, average error over many different splits.

Question 2

Q

LOOCV

Answer

A

leave one out cross validation. un-even splits. hold out one observation, and compute test error on that point. repeat for every observation, and average the errors.

PROBLEM: this is expensive, because you have to run the model n times.

SOLUTION: luckily for linear regression, there is a solution. only run it once and do RSS over leverage statistic ( 1/n sum(RSS/(1-Hii)))

Question 3

Q

k-fold CV

Answer

A

split the data into k portions, compute the test error on the ith fold, average the test errors

Question 4

Q

compare LOOCV with k-fold CV

Answer

A

k-fold CV depends on the split, and we are training the model on less data than is available, this introduces bias into the model (test error is higher than avg). in LOOCV the training samples are all very similar, so this increases the variance of the test error estimate

Question 5

Q

Choosing an optimal model

Answer

A

even if the error estimates are off, choosing minimum CV error often leads to minimum test error. in classification problems, things look pretty similar

Question 6

Q

The one-standard error rule

Answer

A

forward stepwise selection, how many variables should we include in the model? This is the error bars diagram. Choose the simplest model whose CV error is no more than 1 st dev above the model with the lowest cv error (so if minimum is 10, choose 9 or below).

Question 7

Q

The wrong way to do CV

Answer

A

wrong: select 20 most important predictors using z test first. do 10 fold CV and logistic regression. calculated CV error is 3% (should be ~50%). we do 10 fold CV only using predictors that we know are correlated.
right: do the variable selection after you have selected the k-folds. also fit the model each time

Every aspect of the model that involves the data must be cross validated.

Question 8

Q

The learning curve and choosing k

Answer

A

learning curve: the performance of a particular learning method. shape is f(data type, method).

In k-fold CV, as we increase k, we decrease bias but we increase the variance of the CV error. (5 fold CV has little bias on a dataset of 200, test error of n=200 is similar to test error of n=160). the best for bias is LOOCV

Question 9

Q

CV vs bootstrap

Answer

A

CV gives the estimate of the test error.

BS gives the std. error of the estimates

Question 10

Q

Standard Errors in Linear Regression (classical assumptions)

Answer

A

assume x1….xn is normally distributed. assume true variance is close to sigma_hat^2 and true mean is close to x_bar^2. then the SD of this sampling distribution is the standard error.

Question 11

Q

Limits of the classical approach

Answer

A

if x1,xn are not normal, if the estimator doesn’t have a simple form,

SOLUTION: Bootstrap!

Question 12

Q

Bootstrap Standard error

Answer

A

is the standard deviation of the bootstrapped estimators

Question 13

Q

Why do we sample with replacement?

Answer

A

sampling with replacement is non-parametric bootstrap (used in supervised learning methods). we sample with replacement so that each model is independent from one another.

Chapter 5 - Cross Validation Flashcards

(13 cards)