Resampling methods Flashcards

Question 1

Q

Compare model assessment vs model selection.

Answer

A

Model assessment refers to evaluating a model’s performance. In other words, how well does it predict and how low is the test error rate.

Model selection refers to selecting the best model. Also known as selecting the model with the proper flexibility.

Question 2

Q

What is the validation set approach? (Cross-validation)

Answer

A

A strategy for determining the test error rate of a model. It involves dividing data randomly into 2 subsets, 1 subset to be used for training and 1 subset to be used for testing.

Question 3

Q

What are the 2 drawbacks to the simple validation set approach?

Answer

A

The validation estimate of the test-error rate can be highly variable depending upon which data points are included in the training vs test sets.
Only a subset of the data is used to fit the actual model (the training set). As we know, generally statistical methods work better as the sample size increases. Using a limited subset of data usually means that we will over estimate the test error rate relative to if we fit the model with the entire data set.

Question 4

Q

What is leave-one out cross validation (LOOCV)?

Answer

A

It is a cross validation method that attempts to correct for the drawbacks with simply dividing one’s data set into a training set and test set. LOOCV trains the model with every data point except 1, then computes the prediciton for outcome y for the single data point x that was excluded. It calculates the MSE (assuming a regression model) for that model by comparing the actual outcome of y to its predicted value. Then, this process is repeated for every single observation in the data set, and the test error rate is estimated by taking the average of all the MSE’s (assuming regression and not classification) generated, 1 from each model.

Question 5

Q

What advantages does LOOCV have over the basic validation set approach?

Answer

A

LOOCV has far less bias b/c you are using almost the entirety of the data to fit the model whereas in the validation set approach you only use half. Consequently, LOOCV tends not to overestimate the test error rate as much as the validation set approach.
LOOCV will always result in the same answer whereas the validation set approach depends greately upon the observations selected.

Question 6

Q

What is k-fold cross validation?

How does it relate to leave one out cross validation (LOOCV)?

Answer

A

You divide your data set into k groups, fit the model seperately with k - 1 groups and use a single group as the validation set. Typically k is set at 5 or 10. The MSE (assuming regression) is then computed with the validation set. This results in k - 1 estimates of the MSE which are then averaged to arrive at the estimate of the test MSE.

LOOCV is a special case of k-fold CV where k = n

Question 7

Q

How does k-fold CV with k < n compare to LOOCV in terms of the bias-variance trade-off?

Answer

A

From the perspective of bias reduction, LOOCV is clearly better since it trains each model with n-1 observations, basically the whole data set.

However, LOOCV has higher variance than k-fold CV with k < n. This is because all the models used in LOOCV are highly correlated. They are highly correlated b/c they are trained on an almost identical data set. In contrast, with k-fold CV you average only 5 or 10 models usually which will be significantly less correlated with one another. Since the mean of many highly correlated quantities has higher variance than the mean of many quantities that are not as highly correlated, the test error estimate resulting from LOOCV tends to have higher variance than k-fold CV.

Basically, there will be a lot of variablility from one data set fit with LOOCV to another. So while LOOCV has low bias, it can have high variance.

For this reason, typcially k-fold CV with k = 5 or 10 is typically used as these values have been shown empirically to yield test error rate estimates that suffer neither from excessivly high bias nor from very high variance.

Question 8

Q

What is the “bootstrap”?

Answer

A

A statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.

We emulate the process of obtaining new sample sets and use them to estimate the variability of a parameter. Rather than repeatedly obtaining independent data sets from the population (hard to do), we instead obtain distinct data sets by repeatedly sampling observations from the original data set.

Sampling is done with replacement.

Question 9

Q

How does bootstrapping compare to the traditional approach in statistics?

Answer

A

Most of the time, bootstrapping results will be identical to that of the traditional approach. However, boostrapping does not assume any underlying distribution of the data.

The traditional procedure requires one to have a test statistic that satisfies particular assumptions in order to achieve valid results, and this is largely dependent on the experimental design. The traditional approach also uses theory to tell what the sampling distribution should look like, but the results fall apart if the assumptions of the theory are not met. The bootstrapping method, on the other hand, takes the original sample data and then resamples it to create many [simulated] samples. This approach does not rely on the theory since the sampling distribution can simply be observed, and one does not have to worry about any assumptions. This technique allows for accurate estimates of statistics, which is crucial when using data to make decisions.

https://towardsdatascience.com/bootstrapping-statistics-what-it-is-and-why-its-used-e2fa29577307

Resampling methods Flashcards

(9 cards)