Chapter 5-11 Flashcards

Question 1

Q

Sometimes, we want excellent predictions of the positive class. We want high precision and high recall. This can be challenging, why?

P 80

Answer

A

Because increases in recall often come at the expense of decreases in precision. In imbalanced datasets, the goal is to improve recall without hurting precision. These goals, however, are often conflicting, since in order to increase the TP for the minority class, the number of FP is also often increased, resulting in reduced precision.

Question 2

Q

Precision: Appropriate when minimizing ____ is the focus.
Recall: Appropriate when minimizing ____ is the focus.

P 80

Answer

A

false positives
false negatives

Question 3

Q

What is the definition of “sampling methods” for imbalanced datasets?

P 122

Answer

A

The most popular solution to an imbalanced classification problem is to change the composition of the training dataset. Techniques designed to change the class distribution in the training dataset are generally referred to as sampling methods as we are sampling an existing data sample.

Question 4

Q

Sampling is only performed on the training dataset, the dataset used by an algorithm to learn a model. It is not performed on the holdout test or validation dataset. True/False, why?

P 123

Answer

A

True, the reason is that the intent is not to remove the class bias from the model fit but to continue to evaluate the resulting model on data that is both real and representative of the target problem domain.

Question 5

Q

What is oversampling and what are some popular methods of doing it?

P 124

Answer

A

Oversampling methods duplicate examples in the minority class or synthesize new examples from the examples in the minority class. Some of the more widely used and implemented oversampling methods include:
Random Oversampling
Synthetic Minority Oversampling Technique (SMOTE)
Borderline-SMOTE
Borderline Oversampling with SVM
Adaptive Synthetic Sampling (ADASYN)

Question 6

Q

What is the simplest oversampling method called, and how does it work?

P 125

Answer

A

Random Oversampling. It involves randomly duplicating examples from the minority class in the training dataset.

Question 7

Q

What’s he most popular and perhaps most successful oversampling method? How does it work?

P 125

Answer

A

SMOTE; that is an acronym for Synthetic Minority Oversampling Technique. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample as a point along that line.

Question 8

Q

How does Borderline-SMOTE work?

P 125

Answer

A

Borderline-SMOTE involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model or SVM, and only generating synthetic samples that are difficult to classify.

Question 9

Q

How does ADASYN work?

P 125

Answer

A

Adaptive Synthetic Sampling (ADASYN) is another extension to SMOTE that generates synthetic samples inversely proportional to the density of the examples in the minority class. It is designed to create synthetic examples in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high.

Question 10

Q

What does undersampling do? What are some methods of undersampling?

P 126

Answer

A

Undersampling methods delete or select a subset of examples from the majority class. Some of the more widely used and implemented undersampling methods include:
Random Undersampling
Condensed Nearest Neighbor Rule (CNN)
Near Miss Undersampling
Tomek Links Undersampling
Edited Nearest Neighbors Rule (ENN)
One-Sided Selection (OSS)
Neighborhood Cleaning Rule (NCR)

Question 11

Q

Although an oversampling or undersampling method when used alone on a training dataset can be effective, experiments have shown that applying both types of techniques together can often result in better overall performance of a model fit on the resulting transformed dataset. Some of the more widely used and implemented combinations of data sampling methods include: ____ (name 3)

P 126

Answer

A

SMOTE and Random Undersampling
SMOTE and Tomek Links
SMOTE and Edited Nearest Neighbors Rule

It is common to pair SMOTE with an undersampling method that selects examples from the dataset to delete, and the procedure is applied to the dataset
after SMOTE, allowing the editing step to be applied to both the minority and majority class.

Question 12

Q

Random oversampling involves randomly selecting examples from the minority class, without replacement, and adding them to the training dataset. Random undersampling involves randomly
selecting examples from the majority class and deleting them from the training dataset. True/False

P 130

Answer

A

False
Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset. Random undersampling involves randomly
selecting examples from the majority class and deleting them from the training dataset.

These methods are referred to as naive sampling methods because they assume nothing about the data and no heuristics are used.

Question 13

Q

Random over/under sampling is simple to implement and fast to execute, which is desirable for ____ datasets.

P 130

Answer

A

very large and complex

Question 14

Q

The random oversampling may increase the likelihood of occurring overfitting, Why?

P 131

Answer

A

Because it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cover one replicated example.

Question 15

Q

How can we know if random oversampling has caused overfitting?

P 131

Answer

A

To gain insight into the impact of the method, it is a good idea to monitor the
performance on both train and test datasets after oversampling and compare the results to the same algorithm on the original dataset.

Question 16

Q

In random over-sampling, a random set of copies of minority class examples is added to the data. This may increase the likelihood of____, specially for higher over-sampling rates. Moreover, it may ____ the classifier performance and ____ the computational effort.

P 131

Answer

Study These Flashcards

A

Overfitting, decrease, increase

Question 17

Q

What’s a limitation of -random- undersampling?

P 134

Answer

Study These Flashcards

A

A limitation of undersampling is that examples from the majority class are deleted that may be useful, important, or perhaps critical to fitting a robust decision boundary. Given that examples are deleted randomly, there is no way to detect or preserve good or more information-rich examples from the majority class.

Chapter 5-11 Flashcards

(17 cards)