Chapter 5-11 Flashcards

1
Q

Sometimes, we want excellent predictions of the positive class. We want high precision and high recall. This can be challenging, why?

P 80

A

Because increases in recall often come at the expense of decreases in precision. In imbalanced datasets, the goal is to improve recall without hurting precision. These goals, however, are often conflicting, since in order to increase the TP for the minority class, the number of FP is also often increased, resulting in reduced precision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ˆ Precision: Appropriate when minimizing ____ is the focus.
ˆ Recall: Appropriate when minimizing ____ is the focus.

P 80

A

false positives
false negatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the definition of “sampling methods” for imbalanced datasets?

P 122

A

The most popular solution to an imbalanced classification problem is to change the composition of the training dataset. Techniques designed to change the class distribution in the training dataset are generally referred to as sampling methods as we are sampling an existing data sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sampling is only performed on the training dataset, the dataset used by an algorithm to learn a model. It is not performed on the holdout test or validation dataset. True/False, why?

P 123

A

True, the reason is that the intent is not to remove the class bias from the model fit but to continue to evaluate the resulting model on data that is both real and representative of the target problem domain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is oversampling and what are some popular methods of doing it?

P 124

A

Oversampling methods duplicate examples in the minority class or synthesize new examples from the examples in the minority class. Some of the more widely used and implemented oversampling methods include:
ˆ Random Oversampling
ˆ Synthetic Minority Oversampling Technique (SMOTE)
ˆ Borderline-SMOTE
ˆ Borderline Oversampling with SVM
ˆ Adaptive Synthetic Sampling (ADASYN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the simplest oversampling method called, and how does it work?

P 125

A

Random Oversampling. It involves randomly duplicating examples from the minority class in the training dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s he most popular and perhaps most successful oversampling method? How does it work?

P 125

A

SMOTE; that is an acronym for Synthetic Minority Oversampling Technique. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample as a point along that line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does Borderline-SMOTE work?

P 125

A

Borderline-SMOTE involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model or SVM, and only generating synthetic samples that are difficult to classify.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does ADASYN work?

P 125

A

Adaptive Synthetic Sampling (ADASYN) is another extension to SMOTE that generates synthetic samples inversely proportional to the density of the examples in the minority class. It is designed to create synthetic examples in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does undersampling do? What are some methods of undersampling?

P 126

A

Undersampling methods delete or select a subset of examples from the majority class. Some of the more widely used and implemented undersampling methods include:
ˆ Random Undersampling
ˆ Condensed Nearest Neighbor Rule (CNN)
ˆ Near Miss Undersampling
ˆ Tomek Links Undersampling
ˆ Edited Nearest Neighbors Rule (ENN)
ˆ One-Sided Selection (OSS)
ˆ Neighborhood Cleaning Rule (NCR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Although an oversampling or undersampling method when used alone on a training dataset can be effective, experiments have shown that applying both types of techniques together can often result in better overall performance of a model fit on the resulting transformed dataset. Some of the more widely used and implemented combinations of data sampling methods include: ____ (name 3)

P 126

A

ˆ SMOTE and Random Undersampling
ˆ SMOTE and Tomek Links
ˆ SMOTE and Edited Nearest Neighbors Rule

It is common to pair SMOTE with an undersampling method that selects examples from the dataset to delete, and the procedure is applied to the dataset
after SMOTE, allowing the editing step to be applied to both the minority and majority class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Random oversampling involves randomly selecting examples from the minority class, without replacement, and adding them to the training dataset. Random undersampling involves randomly
selecting examples from the majority class and deleting them from the training dataset. True/False

P 130

A

False
Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset. Random undersampling involves randomly
selecting examples from the majority class and deleting them from the training dataset.

These methods are referred to as naive sampling methods because they assume nothing about the data and no heuristics are used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Random over/under sampling is simple to implement and fast to execute, which is desirable for ____ datasets.

P 130

A

very large and complex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The random oversampling may increase the likelihood of occurring overfitting, Why?

P 131

A

Because it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cover one replicated example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can we know if random oversampling has caused overfitting?

P 131

A

To gain insight into the impact of the method, it is a good idea to monitor the
performance on both train and test datasets after oversampling and compare the results to the same algorithm on the original dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In random over-sampling, a random set of copies of minority class examples is added to the data. This may increase the likelihood of____, specially for higher over-sampling rates. Moreover, it may ____ the classifier performance and ____ the computational effort.

P 131

A

Overfitting, decrease, increase

17
Q

What’s a limitation of -random- undersampling?

P 134

A

A limitation of undersampling is that examples from the majority class are deleted that may be useful, important, or perhaps critical to fitting a robust decision boundary. Given that examples are deleted randomly, there is no way to detect or preserve good or more information-rich examples from the majority class.