Model Accuracy pt 2 Flashcards
Validation set
For this exam, the test set is actually the validation set.
Observations from a validation set are those used for validation model fits but not for fitting/training. This includes choosing between models and suggesting a preferred model setup, thus indirectly contributing to f^.
In a situation with all three sets, a test set would then refer to observations used for a final evaluation of the chosen model at the end of the analysis, thus never influencing the modeling process.
Data Partition - Seed
After setting a seed (arbitrary number), random number generators will produce random values that are reproducible every time the seed is reset to the same one
Being able to reproduce randomly generated values allows for results to be replicated. One such result is the random partitioning of a dataset into training and test sets.
Data Partition - Stratified Sampling
Stratification is the act of creating/forming distinct groups called strata. Stratified sampling is the act of randomly sampling from each strata.
It is common to obtain training and test sets by stratified sampling. One way is to stratify the dataset by the target, resulting in both sets having a similar variety of target values. If the training and test sets do not have similar variety of target values, variance is added to the model. The impact of this is similar to overfitting to noise in the training set.
Curse of dimensionality
The problem created when the number of explanatory variables, or the number of levels in explanatory factor variables, is large compared to the volume of data (i.e., the number of observations).
As the number of predictors increases, more observations are needed to retain the same wealth of information, else the wealth becomes more and more diluted.
In a GLM, each additional level in a factor variable results in the creation of an additional binary explanatory variable. Having many variables or variable levels can result in model complexity that is greater than that of the underlying process being modeled. This extra complexity ends up fitting the idiosyncratic noise in the training data.
Cross-Validation
k-fold-cross-validation is a way to tune one or more hyperparameters without using a validation/test set
1. Randomly divide the observations into k groups called folds of roughly equal size.
2. For v = 1, …, k, obtain the vth fit by training with all observations except those in the vth fold.
3. For v = 1, …, k, use y^ from the vth to calculate an accuracy or error metric (e.g. RMSE) with observations in the vth fold.
4. Average the k metrics in step 3 to calculate the CV metric.
5. Repeat steps 2 to 4 for all other combinations of hyperparameter values; the best combination produces the bets CV metric.
This approach involves dividing the training data up into folds, training the model on all but one of the folds, and measuring the performance on the remaining fold. This process is repeated to develop a distribution of performance values for a given hyperparameter (e.g. cp) value. Performance is then determined based on a cross-validation performance metric, for example AUC, and the hyperparameter value with best performance based on this metric is selected. The hyperparameter (e.g. cp) value that yields the best accuracy is then selected based on RMSE, AUC, or some other measure.
One-Standard-Error Rule
For hyperparameters that measure flexibility (e.g. lambda), rather than selecting the value that produces the best CV metric, the one-standard-error rule is an alternative that selects the value denoting the lowest flexibility among the ones producing CV metrics that are within one standard error of the best metric.