CH3 Classification Flashcards by Sarah Veldert

What is binary classifier?

Let’s simplify the problem for now and only try to identify one digit—for example, the number 5. This “5-detector” will be an example of a binary classifier, capable of
distinguishing between just two classes, 5 and not-5

How well did you know this?

Not at all

Perfectly

What is the advantage of the SGD?

Stochastic Gradient Descent (SGD) classifier, using Scikit-Learn’s SGDClassifier class. This clas‐ sifier has the advantage of being capable of handling very large datasets efficiently. This is in part because SGD deals with training instances independently, one at a time
(which also makes SGD well suited for online learning)

How well did you know this?

Not at all

Perfectly

What are the performance measures that are available?

measuring accuracy with cross-validation
confusion matrix
precision
recall
ROC curve

How well did you know this?

Not at all

Perfectly

Why is accuracy generally not the preferred performance measure?

This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets (i.e., when some
classes are much more frequent than others)

How well did you know this?

Not at all

Perfectly

What is the general idea of the confusion matrix?

The general idea is to count the number of times instances of class A are classified as class B.H

How well did you know this?

Not at all

Perfectly

How to compute the confusion matrix?

To compute the confusion matrix, you first need to have a set of predictions, so they can be compared to the actual targets.
Just like the cross_val_score() function, cross_val_predict() performs K-fold cross-validation, but instead of returning the evaluation scores, it returns the predictions made on each test fold. This means that you get a clean prediction for each instance in the training set (“clean” meaning that the prediction is made by a model
that never saw the data during training)
Now you are ready to get the confusion matrix using the confusion_matrix() func‐ tion.

How well did you know this?

Not at all

Perfectly

What does the confusion matrix tell?

Each row in a confusion matrix represents an actual class, while each column represents a predicted class.
A perfect classifier would have only true positives and true negatives, so its confusion matrix would have nonzero values only on its main diago‐
nal (top left to bottom right)

How well did you know this?

Not at all

Perfectly

How is precision calculated?

= TP / (TP + FP)

How well did you know this?

Not at all

Perfectly

What is the formula for recall / sensitivity / true positive rate (TPR)?

= TP (TP + FN)
the ratio of positive instances that are correctly detected by the classifier

How well did you know this?

Not at all

Perfectly

What is the F1-score?

It is often convenient to combine precision and recall into a single metric called the F1 score, in particular if you need a simple way to compare two classifiers. The F1
score is
the harmonic mean of precision and recall (Equation 3-3). Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1
score if both recall and precision are
high.

How well did you know this?

Not at all

Perfectly

What is the formula of F1-score?

TP / (TP + (FN + FP)/2)W

How well did you know this?

Not at all

Perfectly

What is the precision/recall tradeoff?

increasing precision reduces recall, and vice versa

How well did you know this?

Not at all

Perfectly

What is the decision function and decision theshold?

For each instance, it computes a score based on a decision function, and if that score is greater than a threshold, it assigns the instance to the positive
class, or else it assigns it to the negative class.

How well did you know this?

Not at all

Perfectly

How to set different thresholds to compute the precision and recall?

you can call its decision_function() method, which returns a score for each instance, and then make predictions based on those scores using any
threshold you want

How well did you know this?

Not at all

Perfectly

How do you decide which threshold to use?

For this you will first need to get the scores of all instances in the training set using the cross_val_predict() function again, but this time specifying that you want it to return decision scores instead of predictions

Now with these scores you can compute precision and recall for all possible thresh‐ olds using the precision_recall_curve() function

Finally, you can plot precision and recall as functions of the threshold value using Matplotlib

How well did you know this?

Not at all

Perfectly

Why is the precision curve bumpier than the recall curve?

Study These Flashcards

The reason is that precision may sometimes go down when you raise the threshold (although in general it will go up). To understand why, look back at Figure 3-3 and notice what happens when you start from the central threshold and move it just one digit to the right: precision goes from 4/5 (80%) down to 3/4 (75%). On the other hand, recall can only go down when the thres‐
hold is increased, which explains why its curve looks smooth.

What is another way to select a good precision/ recall tradeoff?

Study These Flashcards

Another way to select a good precision/recall tradeoff is to plot precision directly against recall

You can see that precision really starts to fall sharply around 80% recall. You will probably want to select a precision/recall tradeoff just before that drop—for example,
at around 60% recall. But of course the choice depends on your project.

If someone says “let’s reach 99% precision,” you should ask,

Study These Flashcards

At what recall?

What is the ROC?

Study These Flashcards

The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. It is very similar to the precision/recall curve, but instead of plot‐ ting precision versus recall, the ROC curve plots the true positive rate (another name
for recall) against the false positive rate

What is FPR?

Study These Flashcards

The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate, which is the ratio of negative instances that are correctly classified as negative. The
TNR is also called specificity

What does the ROC curve plot?

Study These Flashcards

Hence the ROC curve plots sensitivity (recall) versus 1 – specificity

What is the tradeoff in the ROC curve?

Study These Flashcards

the higher the recall (TPR), the more false positives (FPR) the classifier produces. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward
the top-left corner)

What is a way to compare classifiers?

Study These Flashcards

One way to compare classifiers is to measure the area under the curve (AUC). A per‐ fect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will
have a ROC AUC equal to 0.5

How to decide between the ROC curve and the precision/recall curve?

Study These Flashcards

Since the ROC curve is so similar to the precision/recall (or PR) curve, you may wonder how to decide which one to use. As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise. For example, looking at the previous ROC curve (and the ROC AUC score), you may think that the classifier is really good. But this is mostly because there are few positives (5s) compared to the negatives (non-5s). In contrast, the PR curve makes it clear that the classifier has room for improvement (the curve could be closer to the top-
right corner).

What do you need for the ROC curve instead of scores?

But to plot a ROC curve, you need scores, not probabilities. A simple solution is to use the positive class’s probability as the score:

How are RV and Naive Bayes classifiers handling differently from SVM and linear classifiers>

Some algorithms (such as Random Forest classifiers or naive Bayes classifiers) are capable of handling multiple classes directly. Others (such as Support Vector Machine classifiers or Linear classifiers) are strictly binary classifiers.

What are the two strategies to train multiclass classifiers??

1. OvA 2. OvO

What is OvA?

one way to create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on) one-versus-all (OvA) strategy (also called one-versus-the-rest)

What is OvO

Another strategy is to train a binary classifier for every pair of digits: one to distin‐ guish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This is called the one-versus-one (OvO) strategy. If there are N classes, you need to train N × (N – 1) / 2 classifiers.

What us the main advantage of OvO?

The main advan‐ tage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.

When is OvO preferred and when OvA?

Some algorithms (such as Support Vector Machine classifiers) scale poorly with the size of the training set, so for these algorithms OvO is preferred since it is faster to train many classifiers on small training sets than training few classifiers on large training sets. For most binary classification algorithms, however, OvA is preferred.

What does analyzing the confusion matrix help do?

Analyzing the confusion matrix can often give you insights on ways to improve your classifier

Why would you analyze individual errors?

Analyzing individual errors can also be a good way to gain insights on what your classifier is doing and why it is failing, but it is more difficult and time-consuming.

What is a multilabel classification system?

Such a classification system that outputs multiple binary tags is called a multilabel classification system

What are ways to evaluate a multilabel classifier?

There are many ways to evaluate a multilabel classifier, and selecting the right metric really depends on your project. For example, one approach is to measure the F1 score for each individual label (or any other binary classifier metric discussed earlier), then simply compute the average score. One simple option is to give each label a weight equal to its support (i.e., the number of instances with that target label).

What is multioutput multiclass classification?

multiclass classification (or simply multioutput classification). It is simply a generaliza‐ tion of multilabel classification where each label can be multiclass (i.e., it can have more than two possible values).

How is the line between classification and regression sometimes blurry?

The line between classification and regression is sometimes blurry, such as in this example. Arguably, predicting pixel intensity is more akin to regression than to classification. Moreover, multioutput systems are not limited to classification tasks; you could even have a system that outputs multiple labels per instance, including both class labels and value labels

CH3 Classification Flashcards

(38 cards)