Classification - Part 2 Flashcards
What aspects are important for model evaluation?
- Central question is: How good is a model at classifying unseen records?
- There are Metrics for Model Evaluation (How to measure the performance of a model?)
- There are Methods for Model Evaluation (How to obtain reliable estimates)
Which focus have metrics for model evaluation?
- They focus on the predictive capability of a model (rather than how much time to classify records)
What is the confusion matrix?
- It counts the correct and false classifications
- Counts are the basis for calculating different performance metricsPredicted Class (Y=1)
Y N
Y TP FN
N FP TN
In case of credit card fraud, FN and FP would be unsatisfactory.
What is the formula for accuracy?
(TP + TN) / (TP + TN + FP + FN)
correct predictions / all predictions
What is the formula for error rate?
1 - accuracy
Describe the class imbalance problem
- Sometimes, classes have very unequal frequency (Fraud detection: 98% of transaction OK, 2% fraud)
- The class of interest is commonly called positive class and the remaining negative classes
# negative examples = 9990 # positive examples = 10 -> if model predicts all records negative, the accuracy is 99.9% --> Accuracy is misleading because it does not detect any positive example
How can you mitigate the class imbalance problem?
- Use performance metrics that are biased towards the positive class by ignoring TN
- Precision
- Recall
What is the precision performance metric?
- Number of correctly classified positive examples divided by number of predicted positive examples
p = TP / ( TP + FP)
Question: How many examples that are classified positive are actually positive?
-> False Alarm rate
What is the recall performance metric?
- Number of correctly classified positive examples divided by the actual positive examples
r = TP / (TP + FN)
Question: Which fraction of all positive examples is classified correctly
-> Detection rate
In which case is precision and recall problematic?
- Cases where the count of FP or FN is 0
-> p = 100 % r = 1% for:
1 99
0 1000
–> no negative example is classified wrong, one positive example is classified correct
Consequence:
We need a measure that
1. combines precision and recall and
2. is large if both values are large
Explain the F1-Measure
- Combines precision and recall into one measure
- It is the harmonic mean of precision and recall
- Tends to be closer to the smaller of p and r
- Thus, p and r must be large for a large F1
Formula:
(2pr) / (p +r)
What is the low threshold for the F1-Measure Graph
- Low precision, high recall
What is the restrictive threshold for the F1-Measure graph?
- High precision, low recall
What alternative performance metric can be used if you have domain knowledge?
- Cost-Sensitive Model Evaluation
What is a ROC curve?
- A graphical approach that displays the trade-off between detection rate (recall) and false alarm (precision)
- ROC curves visualize the true positive rate and false positive rate in relation to the algorithms confidence
How is a ROC curve drawn?
- Sort classifications according to confidence scores
- Scan over all classifications:
- right prediction: draw one step up
- wrong prediction: draw one step to the right
How to you interpret a ROC curve?
- The steeper the better
- Random guessing results in diagonal
- Decent classification model should result in a curve above the diagonal
What is to be considered to obtain a reliable estimate of the generalization performance (methods for model evaluation)
- Never test a model on data that was used for training
- That would not result in a reliable estimate of the performance on unseen data
- Keep training and test set strictly separate
- Which labeled records should be used for training and which for testing?
What data set splitting approaches do you know?
- Holdout Method
- Random Subsampling
- Cross Validation
What does the learning curve describe?
- How accuracy changes with growing training set size
-> If low model performance, get more training data (use labeled data rather for training than testing)
Problem: Labeling additional data is often expensive due to manual effort