Week 3: Performance Evaluation Flashcards
What is the error function for a classification problem?
The indicator function of y.hat == y, i.e, value 1 if misspecified, 0 if predicted equals observed outcome.
What is the difference between the error function and the loss function?
The loss function is used when learning/training the model while the error function is used to evaluate the performance of an already learned model.
What is leave-one-out cross-validation?
A special case of the k-fold cross-validation, where the number of batches to validate on (k) equals the number of observations (n). k = n.
What is cross-validation?
What is the error function for a regression problem?
The squared error;
(y.hat - y)^2.
Why can we not approximate E_new with E_train?
Why are we interested in estimating E_new? (4)
1) Judge if performance is satisfactory (E_new small), 2) Help choose between different methods and models, 3) Choose hyperparameters (e.g., k in kNN, regularisation parameter in ridge reg. or no of hidden layers in DL) in order to minimize E_new, 4) Serve as a good measure of the expected performance when presented to a customer.
Which value of E_new and E_train will, in general, be lower?
E_train.
What is VERY important when splitting the complete dataset into training, validation (and possibly also test) parts?
To split the data randomly. E.g., shuffle the data points around before selecting some percentage as training and rest as validation, in order to avoid potential (sorted) patterns or trend in the dataset.
This is most illustrative in the case of classification, where the training data could predict one class while the hold-out validation data would predict the other class.
What is the training error (E_train) for a classification problem with k = 1? Why?
Zero.
When K = 1, you’ll choose the closest training sample to your test sample. Since your test sample is in the training dataset, it’ll choose itself as the closest and never make mistake. For this reason, the training error will be zero when K = 1, irrespective of the dataset.
Note, there’s one logical assumption here; your training set must not include same training samples belonging to different classes, i.e. conflicting information.
What is the overall goal in supervised machine learning?
To minimize E_new (expected new data error).
What is the number of k usually used for k-fold cross-validation?
k = 5 or 10.
Which two ways of cross-validation can we use?
1) k-fold cross-valid, 2) leave-one-out cross-valid.
What does the training error, E_train, tell us?
How well the method is able to “reproduce” the data from which it was learned (not very informative in ML).
Does the chosen loss function and chosen error function need to be the same?
No!
In which situations does it hold that E_hold-out is an unbiased estimate of E_new?
When we assume all data points (training+valid) to be drawn from p(x,y).
If the entire procedure (hold-out) is repeated multiple times, each time with new data, the average E_hold-out will be E_new.
What is the trade-off between a large validation data chunk and a large training data chunk?
If we increase the hold-out validation data, the variance of E_hold-out will decrease -> we expect it to be closer to E_new for a single experiment.
However, increasing the validation data, we also have less data to train the model on, possibly and commonly resulting in a larger E_new than we want.
Use k-fold C-V instead!
Describe the procedure of k-fold cross-validation.
1) Split all available data into k batches of similar size (l=1…,k batches total),
2) take batch l as hold-out validation data and the remaining k-l batches as training data,
3) train the model on training data and compute E^(l)_hold_out (l=1,…k) as the average error on the hold-out validation data, for eah batch l.
4) if l < k, set l <- l +1 and repeat 2-3. If l = k, compute the k-fold cross-validation error by taking the average (1/k * sum) of the E^(l)_hold_out for all k batches.
5) train the model again, using the entire dataset.
What do we take the expectation with respect to when creating E_new?
Expectation over all possible test data points w/ respect to the (unknown, theoretical) distribution p(x, y).
What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?
It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.
What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?
It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.
What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?
It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.
What is the risk of using E_k-fold (or E_hold-out) to choose the ultimate hyperparameter?
It will invalidate E_k-fold (or E_hold-out) as an estimator of E_new as it tends to overfit to the validation data. The result is that E_k-fold is overly optimistic as an estimator of the new data error.
What is the ultimate task if wanting to use E_k-fold as approximator of E_new?
To set aside another (besides the validation data) hold-out dataset, i.e., the test set. Test set should only be used ONCE (after selecting hyperparameters) to estimate E_new for the final model.
What are the three common techniques for creating artificial datasets?
1) To duplicate date and add noise to the duplicated,
2) Simulate data,
3) Use data from a different but related problem.
What is the E.bar_NEW?
The average E_new if we were to train the model multiple times using different datasets (all size n), i.e., E.bar_NEW is average over all possible training datasets.
E.bar_NEW = Exp_TRAIN [E_NEW(TRAIN)]
Why do we use E.bar_NEW and E.bar_TRAIN, rather than E_NEW and E_TRAIN?
Because it is easier to reason about the average behaviour of E.bar_NEW and E.bar_TRAIN, than about the errors E_NEW and E_TRAIN obtained when model is trained on one specific model.
What does it mean (in ML lingo) for a model to generalise from training data?
To be able to perform well on unseen data after being trained.
What is the (expected) generalisation gap?
The difference E.bar_new - E.bar_train.
I.e., the difference between the expected performance on training data and the expected performance “in production” on new, previously unseen data.
Which are the two decompositions of E.bar_NEW we can do?
1) Training Error-Generalisation Gap decomposition and 2) Bias-Variance decomposition.
Why don’t we want to increase model complecity in orer to let the E.bar_train to decrease further?
Due to the risk of overfitting.
What is a rough definition of model complexity?
The number of parameters that are learned from the data.
How can we get a value for E_train in a classification problem using kNN?
By calculating the fraction of misclassified training data points.
Draw the graph of E.bar_new and E.bar_train with model complexity on x, error on y.
Why is the bias-variance composition of E.bar_new irrelevant when comparing different (ML) algorithms, but relevant for choosing (among) tuning parameters for one single method?
Irrelevant because different methods will have different ways of adapting to the data.
Which three components can we split E.bar_new into?
1) variance term, 2) (squared) bias term and 3) irreducible error term
Why can’t we use k-fold CV on time series data?
What typ of cross-validation should be use on time series data?
Rolling cross-validation.
Which loss function do we actually use when training a classifier? Why not the misclassification?
The cross-entropy loss. Besides formulating the loss function not just in termms of the hard class prediction y.hat, we also incorporate the predicted class probabability g(x).
Not misclassification since,
1) using cross-entropy can result in a model that generalizes better from the training data, as the final prediction y.hat does not reveal all aspects of the classifier (this would mean pushing the decision boundaries further away from the training data points),
2) Misclassification would give us a piecewise constant cost function = impossible for numerical optimization as the gradient is zero everywhere (except where undefined).
The cross-entropy loss. Besides formulating the loss function not just in termms of the hard class prediction y.hat, we also incorporate the predicted class probabability g(x).
Not misclassification since,
1) using cross-entropy can result in a model that generalizes better from the training data, as the final prediction y.hat does not reveal all aspects of the classifier (this would mean pushing the decision boundaries further away from the training data points),
2) Misclassification would give us a piecewise constant cost function = impossible for numerical optimization as the gradient is zero everywhere (except where undefined).
Name the major downside with k-fold CV.
It is computationally very costly.
Does k-fold CV improve the upward bias situation that we have for E_hold out when estimating E_new?
No, but it reduces variance.
What are the two conflicting goals regarding T_v and T_train?
That we need a large T_train to reduce bias, but we also need a large T_v to get a less uncertain estimate of a smaller validation data set.
When do we use the rolling CV?
For time series data which is dependent and has a natural order.
What is a necessary assumption for the data points when using k-fold CV?
That the data points are independent and unordered. Therefore, assign data randomly before any splitting into training and validation data.
What is the TPR (true pos. rate)?
TPR is the probability that an actual positive will be predicted as positive.
What is the TPR also called?
The sensitivity.
What is the TNR (true negative rate)?
The probability that an actual negative will be predicted as negative.
What is the TNR also called?
The specificity.
What is the fundamental principle of data usage in ML?
Don’t use the same data for two different purposes.
What does it mean that E_hold out is an upward biased estimate of E_new? Why?
That the estimate E_hold out is expected to be larger than the true value of E_new.
VARFÖR?
Definition of the training error?
It is the error measured from validating the model on the same data as we learned the model on.
Does the generalization gap decrease or increase with model complexity (think of plot with E.new and E.train difference)? Draw a graph!
Increase.
What is true regarding (sq) bias and variance for a model with low complexity?
It has low variance but high bias.
What is true regarding (sq) bias and variance for a model with high complexity?
It has high variance but low bias.
Which error functions do we use?
Misclassification error for classification problem, squared error function for regression.
What is the definition of the error function?
A measure of how much our output prediction for a single data point misses the true value of exactly this data point.
What is the definition of E.bar_new?
The average E_new if we were to train the model on different training data sets.
Illustrate the generalization gap.