3.2 Model evaluation Flashcards
What is random holdout?
Randomly partition data into “training” or “test”
• A portion of the data is “held out”; never seen during training
• Model is tested only on the unseen “holdout” data
What is precision?
how many of the positively classified were relevant. A test can cheat and maximize this by only returning positive on one result it’s most confident in.
What is specificity?
how good a test is at avoiding false alarms. A test can cheat and maximize this by always returning “negative”.
What is Sensitivity/recall?
how good a test is at detecting the positives. A test can cheat and maximize this by always returning “positive”.
While sensitivity identifies the rate at which observations from the positive class are correctly predicted, precision indicates the rate at which positive predictions are correct
What is cross-validation?
The cross-validation has a single hyperparameter “k” that controls the number of subsets that a dataset is split into. Once split, each subset is given the opportunity to be used as a test set while all other subsets together are used as a training dataset.
What is leave-one-out cross-validation?
LOOCV is an extreme version of k-fold cross-validation that has the maximum computational cost. It requires one model to be created and evaluated for each example in the training dataset.
The benefit of so many fit and evaluated models is a more robust estimate of model performance as each row of data is given an opportunity to represent the entirety of the test dataset.
Which partitioning method tends to be the fastest?
Random holdout is likely to be the fastest because it involves just one random partition into train and test sets, so the model would only need to be trained and tested once. All the other methods involve multiple partitions, so the steps (partition, train, test) would be repeated multiple times.
Match each measure (precision, recall, sensitivity, specificity) to its formula. In these formulas, TP = “true positives,” FP = “false positives,” TN = “true negatives,” and FN = “false negatives.”
Precision is the accuracy rate when the model predicts “positive.”
Recall (or sensitivity) is the proportion of the true positive cases that the model is able to detect.
Specificity is the proportion of true negative cases that the model is able to detect.
Which are true of a model with very high precision but very low recall? (Select all that are true)
A model with very high precision and low recall is unlikely to produce false positives, but very likely to produce false negatives. It requires quite a lot of evidence to say that a case is “positive” and as a result is likely to miss many positive cases. When detecting hazards, recall is very important but precision is less important (it’s important to detect every “positive” case, but a few false positives are usually acceptable).
Which of these tasks would be easiest for a 0R classifier (or which task would the accuracy be highest), assuming the training and test set have exactly the same distribution of categories?
Classify A vs. B, when A is 2/3 of the dataset and B is 1/3
The 0R is correct if the test instance belongs to the most common class and incorrect if it belongs to any other class, so the larger the most common class, the higher the accuracy of 0R.
What is the lowest expected performance (in theory, if the classifier had an infinite number of test instances) for a 0R classifier distinguishing between 4 categories, assuming the distribution of categories is exactly the same in the training and test set?
The worst case scenario for 0R is when all categories are equally common (each of the N classes has prior probability of 1/N). For 4 classes, the worst case scenario is that each class is 25% of the dataset and 0R’s expected theoretical accuracy in this case would be 25% correct.
What is repeated random subsampling
• Random holdout repeated multiple times:
• Randomly assign data to “training” and “test” (usually
with a fixed split, like 50-50)
• Train a new model on “training” data
• Test on the “test” data
• Final evaluation: average over all iterations
• Slower, but result should be more reliable than one
random holdout
How does zeroR compare to one R
ZeroR only uses the frequency of the classes to make predictions
OneR
Steps: • For each attribute: • Assign each level of that attribute to the most likely class. • Calculate error rate of this approach.
• Compare error rates of all attributes and select the attribute with the highest classification accuracy – this attribute is the “one rule”.