Model Selection and Validation Flashcards
Validation: statement of bound on generalization error using validation set
Validation: parittion the training set into two sets. One is used for training each of the candidate models, and the second is used for deciding which of them yields the best results.
Once you pick a hypothesis, use new data to estimate its true error.
Let h be some predictor and assume that the loss function is in [0,1]. Then, for every delta in (0,1) with probability of at least 1-delta (over the choice of a validation set V of size mv):
|Lv(h) - Ld(h)| <= sqrt(log(2/delta)/2mv)
Compared to VCdim (comparing it with the fundamental theorem of learning), if mv is in the order of m, validtion is more accurate.
Validation for model selection: statement of bound on generalization error using validation set
Choose the predictor that minimizes the error over the validation set. In other words, we apply ERMh over the validation set.
Assume that the loss function is in [0,1]. Then, for every delta in 0,1 wiht probability >= 1-delta over the choice of a validation set V of size mv:
|Lv(h) - Ld(h)| <= sqrt(log(2|H|/delta)/2mv)
Model-selection curve, train-validation-test split, k-fold cross validation
The model selection curve shows the training error and validation error as a function of the complexity of the model considered. If training error decreases but validation increases it means overfitting
Train validation test split (estimate the true risk after model selection)
- Training set: used to learn the best model hi from each Hi
- Validation set: used to pick one hypothesis h from {h1,…,hr}
- Test set: used to estimate the true risk Ld(h). Not involved in the choice of h
K-fold cross validation: used when we have not much data
- partition (traiing) set into k folds of size m/k
- for each fold:
- train on union of other folds
-estimate error (for learned hypothesis) from the fold
- estimate the true error = average of the estimated errors above
Pseudocode:
input:
training set S = x1y1… xmym
set of parameter values
learning algorithm A
integer k
partition: S into S1…Sk
for each: parameter in the set of parameter
for i 1…k
hi,par = A(S\Si, par)
error (par) = 1/k SUM Ls(hi,par)
output
par* = argmin[error(par)]
hpar = A(S;par)
If learning fails
- if you have parameters to tune, plot model-selection curve to make sure they are tuned appropriately
- if training error is excessively large consider:
- enlarge H
- change H
- change feature representation of the data
- if training error is small, use learning curves to understand wheter problem is approximation error (or estimation error)