Chapter 27 When to Use ROC Curves and Precision-Recall Curves Flashcards

Question 1

Q

What’s the reason for predicting probability in a classification problem?

P 275

Answer

A

The reason for this is to provide the capability to choose/calibrate the threshold for interpreting the predicted probabilities. This can be used to reduce more of one or another type of error (FP/FN)

Question 2

Q

What’s another name for Precision? Why?

P 279

Answer

A

Precision describes how good a model is at predicting the positive class. So it’s referred to as the positive predictive value.

Question 3

Q

What’s a no skill model representation on Precision-Recall axis?

P 279

Answer

A

It is a horizontal line with the value of the ratio of positive cases in the dataset.

A no-skill classifier is one that cannot discriminate between the classes and would predict a random class or a constant class in all cases.
The no-skill line changes based on the distribution of the positive to negative classes. For a balanced dataset, this is 0.5

Question 4

Q

There are also composite scores that attempt to summarize the precision and recall; give 2 examples.

P 279

Answer

A

F-Measure or F1 score: that calculates the harmonic mean of the precision and recall (harmonic mean because the precision and recall are ratios, 2PR/(P+R)).
Area Under Curve of Precision-Recall: like the ROC AUC, summarizes the integral or an approximation of the area under the precision-recall curve.

The F1 score can be calculated by calling the f1_score() function

The area under the precision-recall curve can be approximated by calling the auc() function
and passing it the recall (x) and precision (y) values calculated for each threshold (using precision_recall_curve() function outputs)

Question 5

Q

In terms of model selection, F1 summarizes model skill for a specific probability threshold (0.5), whereas the Precision-Recall area under curve summarizes the skill of a model across thresholds, like ROC AUC. True/False

P 280

Question 6

Q

Precision and recall can be calculated in scikit-learn via the ____ and ____ functions. The precision and recall can be calculated for thresholds using the ____ function that takes the true output values and the probabilities for the positive class as input and returns the precision, recall and threshold values.

P 280

Answer

A

precision_score(), recall_score(),precision_recall_curve()

Question 7

Q

ROC curves should be used when there are roughly equal numbers of observations for each class.
Precision-Recall curves should be used when there is a moderate to large class imbalance.
What’s the reason for this recommendation?

P 282

Answer

A

The reason for this recommendation is that ROC curves present an optimistic picture of the model on datasets with a class imbalance. The main reason for this optimistic picture is because of the use of true negatives in the False Positive Rate in the ROC Curve and the careful avoidance of this rate in the Precision-Recall curve.

If the proportion of positive to negative instances changes in a test set, the ROC curves will not change.

(Me: In order to calculate tpr and fpr, we only use one class, positive for tpr and negative for fpr -refer to confusion matrix-, so the distribution of the classes is not considered when calculating tpr and fpr whereas for metrics such as precision, we use fp and tp, which come from both classes, therefore, the relation/ratio between the classes and how the data is distributed, matters)

Metrics such as accuracy, precision, lift and F scores use values from both columns of the confusion matrix. As a class distribution changes these measures will change as well, even if the fundamental classifier performance
does not. ROC graphs are based upon TP rate and FP rate, in which each dimension is a strict columnar ratio, so do not depend on class distributions.

Chapter 27 When to Use ROC Curves and Precision-Recall Curves Flashcards

(7 cards)