8-Model evaluation 1 Flashcards
What is holdout evaluation strategy?
Partition the data into test and training data. Train only on training data, evaluate only on test data. Usually 80-20, 90-10 splits
What are the advantages of holdout strategy?
Simple to work with and implement
High reproducability
What are the disadvantages of holdout strategy?
Size of the split impacts model behaviour. Too many test instances, not enough learning. Too many training instances, not enough validation.
What is repeated random subsampling?
Run holdout strategy multiple times on random set of training and test elements. Evaluate by averaging across the iterations.
What is the advantage of repeated random subsampling?
Produces more reliable results than holdout strategy alone
What are the disadvantages of repeated random subsampling?
Difficult to reproduce
Slower than holdout strategy
Wrong choice of training set, test set size can lead to highly misleading results
What is cross validation?
Data is split into several partitions (m >= 2) and iteratively one partition is used as test data, while the other m-1 is used as training data. Evaluation metric is aggregated against the other 10.
What are the pros of cross validation?
Very reproducible
Takes roughly the same time as repeated random subsampling
Every instance is a test instance for some partition
Minimises bias and variance of estimates
What is the inductive learning hypothesis?
Any hypothesis found to approximate the target function well over a training data set will approximate the the target function well over unseen data
What is error rate?
Fraction of incorrect predictions
Why is error rate not ideal?
Some problems require us to penalise false negative errors, e.g. medical diagnosis. Others want to penalise false positive errors, e.g. spam classification
What is accuracy?
Accuracy is 1 - error rate
What is precision?
Precision is calculated on a specific label. It is calculated as TP / (TP + FP)
Intuitively it is “how often are we correct when we predict an instance is interesting?”
What is recall?
Recall is calculated on a specific label. It is calculated as TP / (TP + FN)
Intuitively it is “how often do we correctly classify an interesting instance as interesting?”
What relationship holds for precision and recall?
Usually precision and recall are in an inverse relationship
What is the F score?
F1 is used when we want precision and recall to be high.
It is calculated as (1+b^2)PR/(b^2*precision + recall)
What is ROC (Receiver Operating Characteristics) curve?
A graph comparing the rate of true positives and false positives. We want the rate to be in the bottom right triangle of the chart.
TPR = TP/(TP+FN)
FPR = FP(FP + TN)
What is a contigency table?
A 2 by 2 table capturing the TP, FP, TN and FN
What is a confusion matrix?
A n by n table capturing the results of a series of tests relative to an interesting class. The interesting class is the only label deemed as TP. All other classes are TN, even if matched incorrectly with the non-interesting class.
What is macro-averaging?
The average of the precision (recall) values divided by the number of classes
What is micro-averaging?
The sum of TP across classes divided by the sum of TP and FP across classes for micro-precision. Analogous for Recall
What is weighted averaging?
The average of precision (recall) values weighted against the proportion of the class in the test data
What is a baseline?
The baseline model is a naive method or model
What is a benchmark?
A benchmark is an established rival technique that our model is pitched against
What are types of baselines?
Random baseline: Assign a random class
Weighted random: Assign a random class based on proportions of classes in training data set
Zero-R: Assign the most likely class in the training dataset
One-R: Select one attribute and use it to predict an instance’s class. Test each attribute and select the one with the lowest error rate on the training data set
What are the advantages of One-R?
Simple to understand
Simple to comprehend results
What are the disadvantages of One-R?
Unable to capture attribute interactions
Biased towards high-arity attributes