Classification Flashcards
Explain the Precision metric to a five year old.
horizontal: actual
vertical: PREDICTED
T F T TP FN (type II) F FP (type I) TN PRECISION
Precision definition:
Proportion of every observation PREDICTED to be POSITIVE that truly is positive (left, vertical), i.e. how likely are we to be correct when we predict 1?
precision = TP / (TP + FP)
= TP / (yhat=1)
Notice that precision refers to the first column of the CM
Models with high precision are pessimistic in that they only predict an observation to be in the positive if they are very certain about it.
cross_val_score(clf, X, y, scoring=’precision’)
Explain the Recall metric to a five year old.
horizontal: ACTUAL
vertical: predicted
T F T TP FN (type II) RECALL F FP (type I) TN
Recall definition:
Recall is the proportion of every positive observation that is truly positive, i.e. recall measures the model’s ability to predict an observation in the positive class.
Recall = TP / (TP + FN)
= TP / (y=1)
a.k.a. True Positive Rate
Models with high recall are optimistic in that they have a low bar for predicting that an observation is in the positive class (notice that low FN in denominator will increase recall).
cross_val_score(clf, X, y, scoring=’recall’)
Explain the Specificity metric to a five year old.
horizontal: ACTUAL
vertical: predicted
T F T TP FN (type II) F FP (type I) TN SPECIFICITY
Specificity definition:
Specificity is the proportion of every negative class that is truly negative, i.e. specificity measures the model’s ability to predict an observation in the negative class.
Specificity = TN / (FP + TN)
= TN / (y=0)
a.k.a. False Positive Rate
Explain why accuracy is often a poor metric in classification.
Rare class problem
In many cases, there is an imbalance in the classes to be predicted, with one class MUCH more prevalent than the other–e.g., legitimate insurance claims vs. fraudulent (rare) ones, or browsers vs. purchasers (rare) at a website.
The rare class is usually the class of more interest, and is typically designated the positive class 1 in contrast to the more prevalent 0s.
In the typical scenario, the 1s are the more IMPORTANT case, in the sense that misclassifying them as 0s is costlier than misclassifying 0s as 1s.
e.g. correctly identifying a cancer case, class 1, may save a life. On the other hand, correctly identifying a non-cancer case, 0 class, merely saves you the cost of going through the result with a more careful review, which is what you would do if the result was classified “cancer”.
In such cases, unless the classes are easily separable, the most accurate classification model may be one that simply classifies everything obs as a 0. e.g. if only 0.1% of patient screens are positive for cancer, a model that simply predicts that every obs is the negative class 0 will be 99.9% accurate! However, this model will be useless.
Instead we would save more lives with a model that is less accurate overall, but is good at picking out the cancer cases, even if it misclassifies some non-cancer cases along the way.
Explain what an ROC Curve is to a five year old
We can see that there is a tradeoff between recall (TPR) and specificity (FPR). Capturing more 1s generally means misclassifying more 0s as 1s. The ideal classifier would do an excellent job of classifying the 1s, but without misclassifying more 0s as 1s.
The metric that captures the recall vs. specificity tradeoff is the “Receiver Operating Characteristics” curve. The ROC curve plots recall (sensitivity, TPR) on the y-axis against specificity (FPR) on the x-axis.
The ROC curve SHOWS THE TRADEOFF between recall and specificity as we change a classifier’s DECISION THRESHOLD.
The diagonal line, from the origin to the top right corner corresponds to a clf no better than random chance.
An extremely effective clf will have an ROC located in the upper left corner–it will identify lot of 1s without also misclassifying many 0s as 1s.
from sklearn.metrics import roc_curve
# get thresholds for clf with .predict_proba_ method target_probabilities = clf.predict_proba(Xtest)[:,1]
# create TPR, FPR, threshold fpr, tpr, threshold = roc_curve(ytest, target_probailities)
Plot ROC curve
plt. plot(fpr, tpr)
plt. plot([0, 1], ls=”–”) # diagonal line
plt. ylabel(“True Positive Rate”)
plt. xlabel(“False Positive Rate”)
Explain AUC to a five year old,
The ROC curve is a valuable graphical tool but, by itself, doesn’t constitute a single measure for the performance of a clf.
The ROC curve can be used, however, to produce the area UNDERNEATH the ROC curve –> AUC.
AUC is simply the total area under the ROC curve (computed by integration). The LARGER the value of the AUC, the more EFFECTIVE the classifier.
The ROC curve is a method which evaluates the quality of a binary clf by comparing the presence of true positives (TP) and false positives (FP) at EVERY PROBABILITY DECISION THRESHOLD.
An AUC of 1 indicates a perfect clf: it gets all the 1s correctly classified and also doesn’t misclassify any 0s as 1s.
A clf that predicts at random will be on the diagonal line where AUC = 0.5.
sklearn by default uses 0.5 as the decision threshold for a clf to compare against clf generate probabilities, which are assigned to the attribute returned by .predict_proba(). e.g. a classifer proba for one obs may be array([0.86, 0.14]) for classes [0,1], indicating the clf assigns 86% probability to be in the negative, 0 label, class.
We may not want to use a 50% proba decision threshold. Instead of a middle ground, we may want to bias our model to use a different threshold for substantive reasons. e.g. if a false positive is very costly to the company, we might prefer a model with a higher proba decision threshold. The tradeoff is we might FAIL to predict some positives, but when an obs is predicted to be positive, we can be very CONFIDENT that the prediction is CORRECT. This trade-off in TPR vs. FPR is exactly what is reflected in the ROC curve.
The true positive rate (TPR) is the number of obs correctly predicted divided by all actual positive obs:
TPR = TP /(TP+FN)
The false positive rate is the number of INCORRECTLY predicted positives divided by all actual negative obs:
FPR = FP /(FP + TN)
The ROC curve represents respective TPR vs. FPR for every PROBABILITY THRESHOLD.
Since larger AUC is always better, we can use AUC as a way to SELECT the best model among a handful of trained candidate classifiers.
from sklearn.metrics import roc_curve, roc_auc_curve
# get predicted probas target_probabilities = logit.predict_proba(Xtest)[:,1]
#create true and false positive rates false_positive_rate, true_positive_rate, threshold = roc_curve(ytest, target_probabilities)
plot ROC curve
plt. plot(false_positive_rate,true_positive_rate)
plt. plot([0,1], ls = “–’’) # diagonal line
plt. xlabel(‘FPR’)
plt. ylabel(‘TPR’)
# compute AUC roc_auc_score(ytest, target_probabilities)
Explain F1 score metric to a five year old
Precision, TP / (TP+FN) = TP / (yhat=1) = TP among all positive predictions
and Recall, TP / (TP+FN) = TP / (y=1) = TP among all true positive obs,
…are less intuitive than accuracy.
Almost always we want some balance between precision and recall, and this role is filled by the F1 score.
The F1 score is the harmonic mean (a kind of average used by ratios) of precision and recall, where F1 reaches its best value at 1 (perfect precision and recall) and worst at 0:
F1 = 2(PrecisionRecall) / (Precision+Recall)
F1 score is a measure of correctness achieved in positive class prediction–i.e. of obs labeled as positive, how many are actually positive?
from sklearn.metrics import cross_val_score
#cv model using f1-score cross_val_score(clf, X, y, scoring="f1")
How do you compute a confusion matrix for a classifier?
A confusion matrix compares predicted classes vs, true classes and are effective and interpretable viz of clf performace. Each col of the matrix (often viz as heatmap) represents predicted classes, while every row shows true classes. The end result is that every cell is one possible combination of predicted and true classes.
step 1: train a clf on Xtrain and make predictions with Xtest
clf.fit(Xtrain,ytrain)
tgt_pred = clf.predict(Xtest)
step 3: compute confusion matrix
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(Xtest, tgt_pred)
confusion_m = pd.DataFrame(matrix, index=ytrain.columns, columns = ytrain.columns)
sns.heatmap(confusion_m, annot=True)
Explain undersampling and oversampling to a five year old
If we have enough data, one solution to the imbalanced data problem is to UNDERSAMPLE (downsample the prevalent class, so that the data is more balanced between 0s and 1s.)
The basic idea with undersampling is that the data for the dominant class has has many REDUNDANT records. Dealing with smaller, more balanced data set yields benefits in model performance, and makes it easier to prepare the data.
How much data is enough> In general, having tens of thousands of obs for the less dominant class is enough. The more easily distinguishable the 1s are from the 0s, the less data needed.
One criticism of undersampling is that it throws away data and is not using all the available info. This is especially true if you have a relatively small data set, and the rarer class contains a few hundred or a few thousand obs; then undersampling the dominant class has the risk of throwing out useful info.
In the case of a small dataset, instead of downsampling the dominant class, you should OVERSAMPLE (upsample) the rarer class by drawing additional rows with replacement by BOOTSTRAPPING.
Explain how to report a summarization of classification metrics
We can use sklearn classification_report as a quick solution to view common clf evaluation metrics precision, recall, F1 score, and support (number of obs for each class).
class_names = ytrain.columns model = clf.fit(Xtrain, ytrain) yhat_test = model.predict(Xtest)
# create classification report print(classification_report(ytest, yhat_test, target_names=class_names)