Chapter 4 Tour of Model Evaluation Metrics Flashcards
Choosing model evaluation metrics is challenging. This challenge is made even more difficult when there is a skew in the class distribution. Why? P 54
The reason for this is that many of the standard metrics become unreliable or even misleading when classes are imbalanced, or severely imbalanced, such as 1:100 or 1:1000 ratio between a minority and majority class.
We can divide evaluation metrics into three useful groups; What are they? P 55
- Threshold Metrics (e.g., accuracy and F-measure)
- Ranking Metrics (e.g., receiver operating characteristics (ROC) analysis and AUC)
- Probability Metrics (e.g., mean-squared error)
When are threshold metrics used? P 55
Threshold metrics are those that quantify the classification prediction errors. That is, they are designed to summarize the fraction, ratio, or rate of when a predicted class does not match the expected class in a holdout dataset. These measures are used when we want a model to minimize the number of errors.
Perhaps the most widely used threshold metric is … and the complement of classification accuracy called …. P 56
Classification accuracy, classification error
There are two groups of metrics that may be useful for imbalanced classification because they focus on one class; they are … and … P 56
sensitivity(TPR)-specificity(TNR), precision-recall
Sensitivity and Specificity can be combined into a single score that balances both concerns, called the G-mean. How is G-mean calculated? P 57
G-mean = √(Sensitivity × Specificity)
Precision and recall can be combined into a single score that seeks to balance both concerns, called the F-score or the F-measure. How is it calculated? P 57
F-measure = 2 × Precision × Recall/( Precision + Recall)
The F-measure is a popular metric for imbalanced classification. True/False P 57
True
What is F Beta -measure? How is it calculated? P 57
The Fbeta-measure (or F β − measure) measure is an abstraction of the F-measure where the balance of precision and recall in the calculation of the harmonic mean is controlled by a coefficient called beta. Fbeta-measure = (1 + β ^2 ) × Precision × Recall /(β^ 2 × Precision + Recall)
The choice of the β parameter will be used in the name of the Fbeta-measure. For example,
a β value of 2 is referred to as F2-measure or F2-score. A β value of 1 is referred to as the
F1-measure or the F1-score. Three common values for the beta parameter are as follows:
F0.5-measure (β = 0.5): More weight on precision, less weight on recall.
F1-measure (β = 1): Balance the weight on precision and recall.
F2-measure (β = 2): Less weight on precision, more weight on recall.
What is one limitation of the threshold metrics (besides not specifically focusing on the minority class)? P 57
One limitation of the Threshold metrics is that they assume that the class distribution observed in the training dataset will match the distribution in the test set and in real data when the model is used to make predictions. This is often the case, but when it is not the case, the performance can be quite misleading. In particular, they assume that the class imbalance present in the training set is the one that will be encountered throughout the operating life of the classifier.
Ranking metrics don’t make any assumptions about class distributions. True/False? P 57
True
Unlike Threshold metrics that assume that the class distribution observed in the training dataset will match the distribution
in the test set and in real data when the model is used to make predictions.
Rank metrics are more concerned with evaluating classifiers based on how effective they are at … P 58
Separating classes
How do rank metrics evaluate a model’s performance? P 58
These metrics require that a classifier predicts a score or a probability of class membership. From this score, different thresholds can be applied to test the effectiveness of classifiers. Those models that maintain a good score across a range of thresholds will have good class separation and will be ranked higher.
What is the most commonly used ranking metric? P 58
The ROC Curve or ROC Analysis
How’s a no skill model shown as ROC-AUC curve ?
A classifier that has no skill (a no skill model is one that can’t discriminate between the classes and would predict a random class or a constant class in all cases) will be represented by a
diagonal line from the bottom left to the top right.