Chapter 13 Undersampling Methods Flashcards
What’s the simplest undersampling technique?
P157
The simplest undersampling technique involves randomly selecting examples from the majority class and deleting them from the training dataset.
What’s the major drawback of random undersampling?
P 157
The major drawback of random undersampling is that this method can discard potentially useful data that could be important for the induction process.
How can we be more discerning when undersampling data?
P 157
Using heuristics or learning models that attempt to identify redundant examples for deletion or useful examples for non-deletion.
What are the 3 types of undersampling methods in python?
P 157
- methods that select what examples from the majority class to keep (near-miss family of methods, condensed nearestbneighbor rule.),
- methods that select examples to delete (Tomek links, ENN),
- combinations of both approaches (NCR, OSS)
A criticism of the Condensed Nearest Neighbor Rule is that ____
P 167
examples are selected randomly, especially initially.
The condensed nearest-neighbor (CNN) method chooses samples randomly. This results in a) retention of unnecessary samples and b) occasional retention of internal rather than boundary samples.
How does ENN (Edited Nearest Neighbor) work?
P 170
for each instance a in the dataset, its three nearest neighbors are computed. If a is a majority class instance and is misclassified by its three nearest neighbors, then a is removed from the dataset. Alternatively, if a is a minority class instance and is misclassified by its three nearest neighbors, then the majority class instances among a’s neighbors are removed.
In practice, the Tomek Links and ENN procedures are often used on their own. True/False
P 169,171
False.In practice, the Tomek Links procedure is often combined with other methods, such as the Condensed Nearest Neighbor Rule.
Also, like Tomek Links, the Edited Nearest Neighbor Rule gives best results when combined with another undersampling method.
The choice to combine Tomek Links and CNN is natural, as Tomek Links can be said to remove borderline and noisy instances, while CNN removes redundant instances.
One-Sided Selection, or OSS for short, is an undersampling technique that combines ____ and the ____.
P 172
Tomek Links, Condensed Nearest Neighbor (CNN) Rule
Specifically, Tomek Links are ambiguous points on the class boundary and are identified and removed in the majority class. The CNN method is then used to remove redundant examples from the majority class that are far from the decision boundary.
The Neighborhood Cleaning Rule, or NCR for short, is an undersampling technique that combines both the ____ to remove redundant examples and the ____to remove noisy or ambiguous examples.
P 175
Condensed Nearest Neighbor (CNN) Rule, Edited Nearest Neighbors (ENN) Rule
How does Tomek Links work?
External
Tomek links are pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together. Tomek’s algorithm looks for such pairs and removes the majority instance of the pair.
In NCR (Neighborhood Cleaning Rule) unlike OSS (One-Sided Selection), less of the redundant examples are removed and more attention is placed on cleaning those examples that are retained. Why?
P 175
The reason for this is to focus less on improving the balance of the class distribution and more on the quality (unambiguity) of the examples that are retained in the majority class.