Evaluation Flashcards
Training dataset
= attributes + labels
Input
set of annotated training instances
Output
an accurate estimate of the target function underlying the training instance for unseen/testing instance
Inductive Learning Hypothesis
any hypothesis found to approximate/generalize the target function well over a sufficient large training data set will also approximate the target function well over held out/unseen test examples
Three interest points in evaluating a classifier
- Overfitting
- Consistency
- Generalization
Overfitting
fit the training data set too well and did a bad job in generalizing the concept
Consistency
how well the model/classifier perform on the training data; does it flawlessly predict all the labels correctly?
Generalization
the opposite of overfitting; how well the classifier generalizes from the training instances to predict the target function
Classification Evaluation aims to ?
find evidence of consistency and non-overfitting and evidence that support the idea of inductive learning hypothesis (generalization)
Generalization
the proportion of the time the class label is correctly labelled for the test instance
A good model should?
fit the training data well and generalise well to unseen data
An overfitting model has poorer generalisation than a model with high training error
Learning Curves
% of training data versus testing/test Accuracy
Represent the performance of a fixed learning strategy over different sizes of training data, for a fixed evaluation metric; it can also show how much data need to be used in order to achieve certain degree of accuracy
Allow us to visualise the data trade-off
Learning Curves Trend Summary
- Too little training instances are used = poor performance on both training and testing data
- the peak of the training accuracy = overfitting because the model fits too well with the training data but perform poorly on the testing data
- training accuracy starts to drop as the model tries to generalize better on the concept and thus the test accuracy starts to rise (the better job the model does in generalization the narrower the gap between training and test accuracy)
Relationship between the size of training data and accuracy
Generally, the more training instances are used the better the accuracy for the testing data sets because there are more examples for the computer to learn/generalize the concept and thus predict better
Apparent error rate
the error rate that we got from evaluating the training data set
True error rate
the error rate that we got from evaluating the testing data set / real instances
With unlimited samples used as training instances, apparent error rate will eventually become the true error rate
True error rate is almost always much higher than training error rate because of overfitting
difference between true error rate & error rate of the test dataset?
The true error rate is just an estimation with a risk of overfitting
• too fit for the development data or
• only have good accuracy for the one test dataset or not the others
Why we want to know the true error?
this is how we know how well the model do in generalization
Possible evidence of overfitting?
• large gap between training and test accuracy in the learning curve
• Complex decision boundary (which is distorted by noise data)
• Lack of coverage of population in the sample data, due to either
o small number of samples or
o non-randomness in the sample dataset (sampling bias)
Bias and Variance
model bias
evaluation bias
sampling bias
model variance
evaluation variance
It’s rather hard to tell evaluation and model bias & variance apart
These are informal definitions and they can not be measured quantitatively.
A biased classifier is guaranteed to be making errors; an unbiased classifier might/might not be making errors
Although high bias and high variance are often bad, but it does not mean low bias and low variance are good. They’re generally desirable, all else equal.
Bias is generally binary (black-or-white) whilst variance is generally relative (to other classifiers)
model bias
wrong predictions due to the propensity of the classifier; relates to accuracy
o In term of regression problems:
bias = the average of errors
a model is biased if the predictions are systematically higher or lower than the true value
a model is not biased if the predictions are correctly predicted OR some instances are higher and some are lower than the true values
o In term of classification problems: a model is biased if the class distribution of the prediction is the not same with the test dataset a model is not biased if the class distribution of the prediction is the same with the test dataset
evaluation bias
over or under estimate the effectiveness of the classifier due to the propensity of the evaluation strategy
o the estimate of effectiveness of a model is systematically too low or too high