Chapter 4 Tour of Model Evaluation Metrics Flashcards

1
Q

Choosing model evaluation metrics is challenging. This challenge is made even more difficult when there is a skew in the class distribution. Why? P 54

A

The reason for this is that many of the standard metrics become unreliable or even misleading when classes are imbalanced, or severely imbalanced, such as 1:100 or 1:1000 ratio between a minority and majority class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

We can divide evaluation metrics into three useful groups; What are they? P 55

A
  1. Threshold Metrics (e.g., accuracy and F-measure)
  2. Ranking Metrics (e.g., receiver operating characteristics (ROC) analysis and AUC)
  3. Probability Metrics (e.g., mean-squared error)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When are threshold metrics used? P 55

A

Threshold metrics are those that quantify the classification prediction errors. That is, they are designed to summarize the fraction, ratio, or rate of when a predicted class does not match the expected class in a holdout dataset. These measures are used when we want a model to minimize the number of errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Perhaps the most widely used threshold metric is … and the complement of classification accuracy called …. P 56

A

Classification accuracy, classification error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

There are two groups of metrics that may be useful for imbalanced classification because they focus on one class; they are … and … P 56

A

sensitivity(TPR)-specificity(TNR), precision-recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sensitivity and Specificity can be combined into a single score that balances both concerns, called the G-mean. How is G-mean calculated? P 57

A

G-mean = √(Sensitivity × Specificity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Precision and recall can be combined into a single score that seeks to balance both concerns, called the F-score or the F-measure. How is it calculated? P 57

A

F-measure = 2 × Precision × Recall/( Precision + Recall)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The F-measure is a popular metric for imbalanced classification. True/False P 57

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is F Beta -measure? How is it calculated? P 57

A

The Fbeta-measure (or F β − measure) measure is an abstraction of the F-measure where the balance of precision and recall in the calculation of the harmonic mean is controlled by a coefficient called beta.
Fbeta-measure = (1 + β ^2 ) × Precision × Recall /(β^ 2 × Precision + Recall)

The choice of the β parameter will be used in the name of the Fbeta-measure. For example,
a β value of 2 is referred to as F2-measure or F2-score. A β value of 1 is referred to as the
F1-measure or the F1-score. Three common values for the beta parameter are as follows:
ˆ F0.5-measure (β = 0.5): More weight on precision, less weight on recall.
ˆ F1-measure (β = 1): Balance the weight on precision and recall.
ˆ F2-measure (β = 2): Less weight on precision, more weight on recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is one limitation of the threshold metrics (besides not specifically focusing on the minority class)? P 57

A

One limitation of the Threshold metrics is that they assume that the class distribution observed in the training dataset will match the distribution in the test set and in real data when the model is used to make predictions. This is often the case, but when it is not the case, the performance can be quite misleading. In particular, they assume that the class imbalance present in the training set is the one that will be encountered throughout the operating life of the classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Ranking metrics don’t make any assumptions about class distributions. True/False? P 57

A

True

Unlike Threshold metrics that assume that the class distribution observed in the training dataset will match the distribution
in the test set and in real data when the model is used to make predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Rank metrics are more concerned with evaluating classifiers based on how effective they are at … P 58

A

Separating classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do rank metrics evaluate a model’s performance? P 58

A

These metrics require that a classifier predicts a score or a probability of class membership. From this score, different thresholds can be applied to test the effectiveness of classifiers. Those models that maintain a good score across a range of thresholds will have good class separation and will be ranked higher.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the most commonly used ranking metric? P 58

A

The ROC Curve or ROC Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How’s a no skill model shown as ROC-AUC curve ?

A

A classifier that has no skill (a no skill model is one that can’t discriminate between the classes and would predict a random class or a constant class in all cases) will be represented by a
diagonal line from the bottom left to the top right.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Although generally effective, the ROC Curve and ROC AUC can be optimistic under a severe class imbalance, especially when the number of examples in the minority class is small. True/False? P 59

A

True

17
Q

An alternative to the ROC Curve is the …curve that can be used in a similar way, although focuses on the performance of the classifier on the minority class. P 59

A

Precision-recall

again, different thresholds are used on a set of predictions by a model, and in this case, the precision and recall are calculated.

18
Q

How’s a no skill curve on precision-recall metric? P 59

A

A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset. For a balanced dataset this will be 0.5.

19
Q

What are probabilistic metrics designed for specifically? P 60

A

Probabilistic metrics are designed specifically to quantify the uncertainty in a classifier’s predictions.

20
Q

When are probabilistic metrics used? P 60

A

These are useful for problems where we are less interested in incorrect vs. correct class predictions and more interested in the uncertainty the model has in predictions and penalizing those predictions that are wrong but highly confident.

These measures are especially useful when we want an assessment of the reliability of the classifiers, not only measuring when they fail but whether they have selected the wrong class with a high or low probability.

21
Q

Perhaps the most common metric for evaluating predicted probabilities is … for binary classification (or the…), or known more generally as … Another popular score for predicted probabilities is the …. P 61

A

log loss, negative log likelihood, cross-entropy, Brier score

22
Q

What’s the benefit of the Brier score compared to log loss when dealing with imbalanced datases? P 61

A

Brier score is focused on the positive class, which for imbalanced classification is the minority class. This makes it more preferable than log loss, which is focused on the entire probability distribution.

23
Q

What is log loss and Brier score formula? P 61

A
LogLoss = −((1 − y) × log(1 − yhat) + y × log(yhat))
BrierScore = (1 /N) sum( (yhati − yi) ^2)
24
Q

Can Brier score be used for multi-class problems? P 61

A

Although typically described in terms of binary classification tasks, the Brier score can also be calculated for multiclass classification problems.

25
Q

The differences in Brier score for different classifiers can be very small. How is this problem addressed? P 61

A

In order to address this problem, the score can be scaled against a reference score, such as the score from a no skill classifier (e.g. predicting the probability distribution of the positive class in the training dataset). Using the reference score, a Brier Skill Score, or BSS, can be calculated where 0.0 represents no skill, worse than no skill results are negative, and the perfect skill is represented by a value of 1.0.

26
Q

What’s BrierSkillScore formula? P 61

A

BrierSkillScore = 1 − BrierScore /BrierScoreref

27
Q

Although popular for balanced classification problems, probability scoring methods are less widely used for classification problems with a skewed class distribution. True/Fasle? P 61

A

True

28
Q

How can we choose metrics for our problem? P 62

A
  1. Perhaps the best approach is to talk to project stakeholders and figure out what is important about a model or set of predictions. Then select a few metrics that seem to capture what is important, then test the metric with different scenarios.
  2. Another approach might be to perform a literature review and discover what metrics are most commonly used by other practitioners or academics working on the same general type of problem.

The second approach can often be insightful, but be warned that some fields of study may fall into groupthink and adopt a metric that might be excellent for comparing large numbers of models at scale, but terrible for model selection in practice.

29
Q

In a multi-class classification setup, ____ (micro avg/macro avg) is preferable if you suspect there might be a class imbalance.

External

A

macro-average

If your goal is for your classifier simply to maximize its hits and minimize its misses, micro-average would be the way to go.

However, if you valued the minority class the most, you should switch to a macro-average, where you would only get a 50% score. This metric is insensitive to the imbalance of the classes and treats them all as equal.

30
Q

How are TP,TN,FP,FN calculated for multiclass datasets?

External

A

In a 3*3 confusion matrix for multiclass data:

FN: The False-negative value for a class will be the sum of values of corresponding rows except for the TP value. ( cell2+ cell3 for setosa)

FP: The False-positive value for a class will be the sum of values of the corresponding column except for the TP value. (cell4+ cell7 for setosa)

TN: The True Negative value for a class will be the sum of values of all columns and rows except the values of that class that we are calculating the values for. (cell5+cell6+cell8+cell9 for setosa)

TP: The True positive value is where the actual value and predicted value are the same. (cell1 for setosa)