Chapter 3: Classification Flashcards

Question 1

Q

When scoring a classification model, what does accuracy refer to?

Answer

A

The ratio of correct predictions.

Question 2

Q

Why is accuracy not always the best way of quantifying a classification models performance?

Answer

A

When a data set is has a class imbalance (some classes are more frequent than others), accuracy can be misleading high and won’t show how well the model actually discriminates between class instances.

For example, if a model was being trained to classify an imagine of a cat and guesses “not cat” every time, if the dataset contains 95% images of dogs, the accuracy of the model will be 95% of the time, but doesn’t actually provide any useful predictions.

Question 3

Q

What is preferred over accuracy?

Answer

A

The confusion matrix.

Question 4

Q

What is a confusion matrix?

Answer

A

A confusion matrix counts the number of times instances of each class have been classed as a given class. The rows of the matrix are the true labels and the columns of the matrix are the predicted labels.

Question 5

Q

What is a Type I error?

Answer

A

Type I error refers to false positive classifications. False positive are instances that are of the negative class that are classified as positive.

Question 6

Q

What is a Type II error?

Answer

A

Type II error refers to false negative classifications. False negatives are instances of the positive class that are classified as negative.

Question 7

Q

What would the confusion matrix of a perfect classifier look like?

Answer

A

A perfect classifier would only have true positives and true negatives, therefore the confusion matrix would have non-zero values on its main diagonal (top left to bottom right).

Question 8

Q

What is a harmonic mean?

Answer

A

A type of average that is calculated by dividing the number of values in a data series by the sum of the reciprocals (1/x_i) of each value in the data series.

A harmonic mean gives much more weight to low values.

Question 9

Q

What is the F1 score?

Answer

A

The F1 score of a classifier is the harmonic mean of the precision and recall of that classifier. As it is a harmonic mean, it is only possible to get a high F1 score if both precision and recall are high.

Question 10

Q

Is it always preferable to have similar values for precision and recall?

Answer

A

No, in some instances you may favour a higher value of precision or recall and not be too concerned about the other.

Question 11

Q

What is an example where a high precision and low recall would be acceptable?

Answer

A

Instances where a false positives are far more costly than a false negative.

A content classifier that classifies if videos are suitable for children. High precision would mean that only safe videos are filtered through but low recall would mean that many safe videos are rejected.

You would rather that a lot of safe videos were rejected (low recall) but keeps only safe ones (high precision), than one that accepted more safe videos (higher recall) but lets more harmful content through (lower precision).

Question 12

Q

What is an example where a low precision and high recall would be acceptable?

Answer

A

Instances where a false negatives are far more costly than a false positives.

A classifier that detects if there is a weapon carried by someone by scanning an X-ray. Low precision may result in false positives that need to be checked, but having high recall would ensure nearly all weapon carrying individuals were stopped.

Question 13

Q

What is the receiver operating characteristic (ROC) curve?

Answer

A

A plot of true positive rate (recall) vs false positive rate (fall-out).

Question 14

Q

What does specificity refer to?

Answer

A

Specificity is the true negative rate, which is the ratio of negative instances classified as negative.

Question 15

Q

What does sensitivity refer to?

Answer

A

Sensitivity is the true positive rate, which is the ratio of positive instances classified as positive. Recall is another term for sensitivity.

Question 16

Q

What is a way of comparing classifier performance from the ROC curve?

Answer

Study These Flashcards

A

Calculate the area under the curve (AUC):
- AUC = 1, perfect classifier.
- AUC = 0.5, random classifier.

Question 17

Q

When is it preferable to use a Precision vs Recall curve (PRC) over a Receiver Operating Characteristic curve (ROC curve)?

Answer

Study These Flashcards

A

A rule of thumb for this is prefer a PRC over a ROC curve when the positive class is rare or when you care more about false positive than false negatives.

Question 18

Q

How should one approach performing error analysis on a classification model?

Answer

Study These Flashcards

A

Aggregate predictions:
- Look at the confusion matrix.
- Filter out correct predictions.
- Look for patterns in the matrix e.g. are incorrect predictions randomly distributed across classes or are there classes which are systematically predicted incorrectly.
- Target efforts at reducing the systematic error if there is one present.

Question 19

Q

How can you reduce systematic class errors?

Answer

Study These Flashcards

A

More training data on that class.
Engineer new features to help the classifier.
More thorough pre-processing.

Chapter 3: Classification Flashcards

(19 cards)