Week 3: Performance Evaluation Flashcards
What is the error function for a classification problem?
The indicator function of y.hat == y, i.e, value 1 if misspecified, 0 if predicted equals observed outcome.
What is the difference between the error function and the loss function?
The loss function is used when learning/training the model while the error function is used to evaluate the performance of an already learned model.
What is leave-one-out cross-validation?
A special case of the k-fold cross-validation, where the number of batches to validate on (k) equals the number of observations (n). k = n.
What is cross-validation?
What is the error function for a regression problem?
The squared error;
(y.hat - y)^2.
Why can we not approximate E_new with E_train?
Why are we interested in estimating E_new? (4)
1) Judge if performance is satisfactory (E_new small), 2) Help choose between different methods and models, 3) Choose hyperparameters (e.g., k in kNN, regularisation parameter in ridge reg. or no of hidden layers in DL) in order to minimize E_new, 4) Serve as a good measure of the expected performance when presented to a customer.
Which value of E_new and E_train will, in general, be lower?
E_train.
What is VERY important when splitting the complete dataset into training, validation (and possibly also test) parts?
To split the data randomly. E.g., shuffle the data points around before selecting some percentage as training and rest as validation, in order to avoid potential (sorted) patterns or trend in the dataset.
This is most illustrative in the case of classification, where the training data could predict one class while the hold-out validation data would predict the other class.
What is the training error (E_train) for a classification problem with k = 1? Why?
Zero.
When K = 1, you’ll choose the closest training sample to your test sample. Since your test sample is in the training dataset, it’ll choose itself as the closest and never make mistake. For this reason, the training error will be zero when K = 1, irrespective of the dataset.
Note, there’s one logical assumption here; your training set must not include same training samples belonging to different classes, i.e. conflicting information.
What is the overall goal in supervised machine learning?
To minimize E_new (expected new data error).
What is the number of k usually used for k-fold cross-validation?
k = 5 or 10.
Which two ways of cross-validation can we use?
1) k-fold cross-valid, 2) leave-one-out cross-valid.
What does the training error, E_train, tell us?
How well the method is able to “reproduce” the data from which it was learned (not very informative in ML).
Does the chosen loss function and chosen error function need to be the same?
No!
In which situations does it hold that E_hold-out is an unbiased estimate of E_new?
When we assume all data points (training+valid) to be drawn from p(x,y).
If the entire procedure (hold-out) is repeated multiple times, each time with new data, the average E_hold-out will be E_new.
What is the trade-off between a large validation data chunk and a large training data chunk?
If we increase the hold-out validation data, the variance of E_hold-out will decrease -> we expect it to be closer to E_new for a single experiment.
However, increasing the validation data, we also have less data to train the model on, possibly and commonly resulting in a larger E_new than we want.
Use k-fold C-V instead!
Describe the procedure of k-fold cross-validation.
1) Split all available data into k batches of similar size (l=1…,k batches total),
2) take batch l as hold-out validation data and the remaining k-l batches as training data,
3) train the model on training data and compute E^(l)_hold_out (l=1,…k) as the average error on the hold-out validation data, for eah batch l.
4) if l < k, set l <- l +1 and repeat 2-3. If l = k, compute the k-fold cross-validation error by taking the average (1/k * sum) of the E^(l)_hold_out for all k batches.
5) train the model again, using the entire dataset.
What do we take the expectation with respect to when creating E_new?
Expectation over all possible test data points w/ respect to the (unknown, theoretical) distribution p(x, y).
What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?
It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.
What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?
It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.
What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?
It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.
What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?
It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.
What is the ultimate task if wanting to use E_k-fold as approximator of E_new?
To set aside another (besides the validation data) hold-out dataset, i.e., the test set. Test set should only be used ONCE (after selecting hyperparameters) to estimate E_new for the final model.