Chapter 13 Undersampling Methods Flashcards

Question 1

Q

What’s the simplest undersampling technique?

P157

Answer

A

The simplest undersampling technique involves randomly selecting examples from the majority class and deleting them from the training dataset.

Question 2

Q

What’s the major drawback of random undersampling?

P 157

Answer

A

The major drawback of random undersampling is that this method can discard potentially useful data that could be important for the induction process.

Question 3

Q

How can we be more discerning when undersampling data?

P 157

Answer

A

Using heuristics or learning models that attempt to identify redundant examples for deletion or useful examples for non-deletion.

Question 4

Q

What are the 3 types of undersampling methods in python?

P 157

Answer

A

methods that select what examples from the majority class to keep (near-miss family of methods, condensed nearestbneighbor rule.),
methods that select examples to delete (Tomek links, ENN),
combinations of both approaches (NCR, OSS)

Question 5

Q

A criticism of the Condensed Nearest Neighbor Rule is that ____

P 167

Answer

A

examples are selected randomly, especially initially.

The condensed nearest-neighbor (CNN) method chooses samples randomly. This results in a) retention of unnecessary samples and b) occasional retention of internal rather than boundary samples.

Question 6

Q

How does ENN (Edited Nearest Neighbor) work?

P 170

Answer

A

for each instance a in the dataset, its three nearest neighbors are computed. If a is a majority class instance and is misclassified by its three nearest neighbors, then a is removed from the dataset. Alternatively, if a is a minority class instance and is misclassified by its three nearest neighbors, then the majority class instances among a’s neighbors are removed.

Question 7

Q

In practice, the Tomek Links and ENN procedures are often used on their own. True/False

P 169,171

Answer

A

False.In practice, the Tomek Links procedure is often combined with other methods, such as the Condensed Nearest Neighbor Rule.

Also, like Tomek Links, the Edited Nearest Neighbor Rule gives best results when combined with another undersampling method.

The choice to combine Tomek Links and CNN is natural, as Tomek Links can be said to remove borderline and noisy instances, while CNN removes redundant instances.

Question 8

Q

One-Sided Selection, or OSS for short, is an undersampling technique that combines ____ and the ____.

P 172

Answer

A

Tomek Links, Condensed Nearest Neighbor (CNN) Rule

Specifically, Tomek Links are ambiguous points on the class boundary and are identified and removed in the majority class. The CNN method is then used to remove redundant examples from the majority class that are far from the decision boundary.

Question 9

Q

The Neighborhood Cleaning Rule, or NCR for short, is an undersampling technique that combines both the ____ to remove redundant examples and the ____to remove noisy or ambiguous examples.

P 175

Answer

A

Condensed Nearest Neighbor (CNN) Rule, Edited Nearest Neighbors (ENN) Rule

Question 10

Q

How does Tomek Links work?

External

Answer

A

Tomek links are pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together. Tomek’s algorithm looks for such pairs and removes the majority instance of the pair.

Ref

Question 11

Q

In NCR (Neighborhood Cleaning Rule) unlike OSS (One-Sided Selection), less of the redundant examples are removed and more attention is placed on cleaning those examples that are retained. Why?

P 175

Answer

A

The reason for this is to focus less on improving the balance of the class distribution and more on the quality (unambiguity) of the examples that are retained in the majority class.

Chapter 13 Undersampling Methods Flashcards

(11 cards)