Cross Validation Flashcards
cross-validation vs train test split ( 1 mark)
Cross-validation extends this approach to model scoring (or “model validation.”) Compared to train_test_split, cross-validation gives you a more reliable measure of your model’s quality, though it takes longer to run
cross-validation
In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality.
how does cross-validation work?
- We run an experiment called experiment 1 which uses the first fold as a holdout set, and everything else as training data. This gives us a measure of model quality based on a 20% holdout set, much as we got from using the simple train-test split.
We then run a second experiment, where we hold out data from the second fold (using everything except the 2nd fold for training the model.) This gives us a second estimate of model quality. We repeat this process, using every fold once as the holdout. Putting this together, 100% of the data is used as a holdout at some point.
Trade-offs Between Cross-Validation and Train-Test Split
- Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. However, it can take more time to run, because it estimates models once for each fold. So it is doing more total work
- On small datasets, the extra computational burden of running cross-validation isn’t a big deal. These are also the problems where model quality scores would be least reliable with train-test split. So, if your dataset is smaller, you should run cross validation.
- a simple train-test split is sufficient for larger datasets. It will run faster, and you may have enough data that there’s little need to re-use some of it for holdout.
- If your model takes a couple minute or less to run, it’s probably worth switching to cross-validation. If your model takes much longer to run, cross-validation may slow down your workflow more than it’s worth.
- You can run cross-validation and see if the scores for each experiment seem close. If each experiment gives the same results, train-test split is probably sufficient.
- Using cross-validation gave us much better measures of model quality, with the added benefit of cleaning up our code