Chapter 13 Undersampling Methods Flashcards

1
Q

What’s the simplest undersampling technique?

P157

A

The simplest undersampling technique involves randomly selecting examples from the majority class and deleting them from the training dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What’s the major drawback of random undersampling?

P 157

A

The major drawback of random undersampling is that this method can discard potentially useful data that could be important for the induction process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we be more discerning when undersampling data?

P 157

A

Using heuristics or learning models that attempt to identify redundant examples for deletion or useful examples for non-deletion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 3 types of undersampling methods in python?

P 157

A
  • methods that select what examples from the majority class to keep (near-miss family of methods, condensed nearestbneighbor rule.),
  • methods that select examples to delete (Tomek links, ENN),
  • combinations of both approaches (NCR, OSS)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A criticism of the Condensed Nearest Neighbor Rule is that ____

P 167

A

examples are selected randomly, especially initially.

The condensed nearest-neighbor (CNN) method chooses samples randomly. This results in a) retention of unnecessary samples and b) occasional retention of internal rather than boundary samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does ENN (Edited Nearest Neighbor) work?

P 170

A

for each instance a in the dataset, its three nearest neighbors are computed. If a is a majority class instance and is misclassified by its three nearest neighbors, then a is removed from the dataset. Alternatively, if a is a minority class instance and is misclassified by its three nearest neighbors, then the majority class instances among a’s neighbors are removed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In practice, the Tomek Links and ENN procedures are often used on their own. True/False

P 169,171

A

False.In practice, the Tomek Links procedure is often combined with other methods, such as the Condensed Nearest Neighbor Rule.

Also, like Tomek Links, the Edited Nearest Neighbor Rule gives best results when combined with another undersampling method.

The choice to combine Tomek Links and CNN is natural, as Tomek Links can be said to remove borderline and noisy instances, while CNN removes redundant instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

One-Sided Selection, or OSS for short, is an undersampling technique that combines ____ and the ____.

P 172

A

Tomek Links, Condensed Nearest Neighbor (CNN) Rule

Specifically, Tomek Links are ambiguous points on the class boundary and are identified and removed in the majority class. The CNN method is then used to remove redundant examples from the majority class that are far from the decision boundary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The Neighborhood Cleaning Rule, or NCR for short, is an undersampling technique that combines both the ____ to remove redundant examples and the ____to remove noisy or ambiguous examples.

P 175

A

Condensed Nearest Neighbor (CNN) Rule, Edited Nearest Neighbors (ENN) Rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Tomek Links work?

External

A

Tomek links are pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together. Tomek’s algorithm looks for such pairs and removes the majority instance of the pair.

Ref

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In NCR (Neighborhood Cleaning Rule) unlike OSS (One-Sided Selection), less of the redundant examples are removed and more attention is placed on cleaning those examples that are retained. Why?

P 175

A

The reason for this is to focus less on improving the balance of the class distribution and more on the quality (unambiguity) of the examples that are retained in the majority class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly