Overfitting Flashcards
What’s the main thing that controls the complexity of a model? Give examples.
- # of predictors (including quadratic+ powers and interactions terms)
- Hyperparameters. Examples:
Decision Trees: branch depth (higher = more overfitting)
KNearestNeighbors: N neighbors (lower = more overfitting)
What is in-sample error and how does it increase/decrease if you make your model more complex?
It is the total error (e.g., MSE) of the TRAINING data (sub)set. It always decreases with higher model complexity.
What is out-of-sample error and how does it increase/decrease if you make your model more complex?
It is model error measured on the data which the model did not see during training (ie the test/validation data).
It is lowest at the point where the model is optimally complex, and higher when the model is either not complex enough (underfitting) or too complex (overfitting).
Therefore, out-of-sample error is the thing you want to minimize to reach the optimal model. Different model families (e.g., regression vs decision tree) and different hyperparameters within each (e.g., “max_depth”) will achieve different out-of-sample errors.
What does train_test_split do and how do you use it all syntactical-like?
It randomly splits your data (both the predictors X and outcomes y) into training (most of the data) vs test (rest of the data) samples.
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
What are the two components of out-of-sample error?
out-of-sample error (aka test data error) = in-sample error (aka training data error, aka bias) + variance.
Bias is the extent to which the model is off from the training data and always decreases w/model complexity.
Variance is the extent to which the model captures noise and always increases w/model complexity.
“The more signal you pick up, the more noise you also pick up.”
Since out-of-sample error is the sum of the two, it is lowest somewhere in the middle.
What is cross-validation, aka K-fold validation? And what is its output metric?
Divide the dataset into K parts and try every possible permutation of which one of them is considered the test/validation subset (the rest is the training subset).
Output metric: cross-validation error (the mean of the K out-of-sample errors). The goal is to choose the model and hyperparameters that minimize it.
What is cross-validation error?
It’s the output of cross-validation (deuhrrr) - specifically the mean of the K training dataset errors. The goal is to choose the model and hyperparameters that minimize it.
At a high-level, what is the order of operations for running k-fold validation?
- Shuffle both X and y the same way first, using sklearn’s built-in method:
X_random_order, y_random_order = shuffle(X, y, random_state=42)
This is because sklearn’s cross_val_score simply takes consecutive chunks in the data in the data it’s fed.
- Vary the hyperparameter and output the cross-validation error each time.
- Choose the hyperparameter value with the lowest cross-validation error and create the final model using that hyperparameter value. Use the ENTIRE data this time, not just a training chunk.
What is the most automated way to run many variations of a model, figure out the best variation, and train the final model all at once?
GridSearchCV. It also lets you check which hyperparameters turned out to be optimal.
What is “grid search”?
How does it relate to the training/test data split?
An automated way to run many variations of a model (within a hyperparameter space that you specify), figure out the best variation, and train the final model all at once. It also lets you check which hyperparameters turned out to be optimal.
Optimally, should you run cross-validation and/or grid search on your entire dataset?
No.
The intuitive answer is “yes” because those two techniques split the data you give them into training and “test” subsets. But ideally, you also leave aside a “super test” data chunk that those techniques NEVER see or train on. After x-validation or grid search are done running and spit out the optimal model, you THEN run the model on the “super test” X to see how well it predicts the “super test” y.