Chapter 5 - Cross Validation Flashcards
Validation Set Approach
split into two parts, estimate the test error for a supervised learning method. Do this splitting many times, average error over many different splits.
LOOCV
leave one out cross validation. un-even splits. hold out one observation, and compute test error on that point. repeat for every observation, and average the errors.
PROBLEM: this is expensive, because you have to run the model n times.
SOLUTION: luckily for linear regression, there is a solution. only run it once and do RSS over leverage statistic ( 1/n sum(RSS/(1-Hii)))
k-fold CV
split the data into k portions, compute the test error on the ith fold, average the test errors
compare LOOCV with k-fold CV
k-fold CV depends on the split, and we are training the model on less data than is available, this introduces bias into the model (test error is higher than avg). in LOOCV the training samples are all very similar, so this increases the variance of the test error estimate
Choosing an optimal model
even if the error estimates are off, choosing minimum CV error often leads to minimum test error. in classification problems, things look pretty similar
The one-standard error rule
forward stepwise selection, how many variables should we include in the model? This is the error bars diagram. Choose the simplest model whose CV error is no more than 1 st dev above the model with the lowest cv error (so if minimum is 10, choose 9 or below).
The wrong way to do CV
wrong: select 20 most important predictors using z test first. do 10 fold CV and logistic regression. calculated CV error is 3% (should be ~50%). we do 10 fold CV only using predictors that we know are correlated.
right: do the variable selection after you have selected the k-folds. also fit the model each time
Every aspect of the model that involves the data must be cross validated.
The learning curve and choosing k
learning curve: the performance of a particular learning method. shape is f(data type, method).
In k-fold CV, as we increase k, we decrease bias but we increase the variance of the CV error. (5 fold CV has little bias on a dataset of 200, test error of n=200 is similar to test error of n=160). the best for bias is LOOCV
CV vs bootstrap
CV gives the estimate of the test error.
BS gives the std. error of the estimates
Standard Errors in Linear Regression (classical assumptions)
assume x1….xn is normally distributed. assume true variance is close to sigma_hat^2 and true mean is close to x_bar^2. then the SD of this sampling distribution is the standard error.
Limits of the classical approach
if x1,xn are not normal, if the estimator doesn’t have a simple form,
SOLUTION: Bootstrap!
Bootstrap Standard error
is the standard deviation of the bootstrapped estimators
Why do we sample with replacement?
sampling with replacement is non-parametric bootstrap (used in supervised learning methods). we sample with replacement so that each model is independent from one another.