Chapter 5-11 Flashcards
Sometimes, we want excellent predictions of the positive class. We want high precision and high recall. This can be challenging, why?
P 80
Because increases in recall often come at the expense of decreases in precision. In imbalanced datasets, the goal is to improve recall without hurting precision. These goals, however, are often conflicting, since in order to increase the TP for the minority class, the number of FP is also often increased, resulting in reduced precision.
Precision: Appropriate when minimizing ____ is the focus.
Recall: Appropriate when minimizing ____ is the focus.
P 80
false positives
false negatives
What is the definition of “sampling methods” for imbalanced datasets?
P 122
The most popular solution to an imbalanced classification problem is to change the composition of the training dataset. Techniques designed to change the class distribution in the training dataset are generally referred to as sampling methods as we are sampling an existing data sample.
Sampling is only performed on the training dataset, the dataset used by an algorithm to learn a model. It is not performed on the holdout test or validation dataset. True/False, why?
P 123
True, the reason is that the intent is not to remove the class bias from the model fit but to continue to evaluate the resulting model on data that is both real and representative of the target problem domain.
What is oversampling and what are some popular methods of doing it?
P 124
Oversampling methods duplicate examples in the minority class or synthesize new examples from the examples in the minority class. Some of the more widely used and implemented oversampling methods include:
Random Oversampling
Synthetic Minority Oversampling Technique (SMOTE)
Borderline-SMOTE
Borderline Oversampling with SVM
Adaptive Synthetic Sampling (ADASYN)
What is the simplest oversampling method called, and how does it work?
P 125
Random Oversampling. It involves randomly duplicating examples from the minority class in the training dataset.
What’s he most popular and perhaps most successful oversampling method? How does it work?
P 125
SMOTE; that is an acronym for Synthetic Minority Oversampling Technique. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample as a point along that line.
How does Borderline-SMOTE work?
P 125
Borderline-SMOTE involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model or SVM, and only generating synthetic samples that are difficult to classify.
How does ADASYN work?
P 125
Adaptive Synthetic Sampling (ADASYN) is another extension to SMOTE that generates synthetic samples inversely proportional to the density of the examples in the minority class. It is designed to create synthetic examples in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high.
What does undersampling do? What are some methods of undersampling?
P 126
Undersampling methods delete or select a subset of examples from the majority class. Some of the more widely used and implemented undersampling methods include:
Random Undersampling
Condensed Nearest Neighbor Rule (CNN)
Near Miss Undersampling
Tomek Links Undersampling
Edited Nearest Neighbors Rule (ENN)
One-Sided Selection (OSS)
Neighborhood Cleaning Rule (NCR)
Although an oversampling or undersampling method when used alone on a training dataset can be effective, experiments have shown that applying both types of techniques together can often result in better overall performance of a model fit on the resulting transformed dataset. Some of the more widely used and implemented combinations of data sampling methods include: ____ (name 3)
P 126
SMOTE and Random Undersampling
SMOTE and Tomek Links
SMOTE and Edited Nearest Neighbors Rule
It is common to pair SMOTE with an undersampling method that selects examples from the dataset to delete, and the procedure is applied to the dataset
after SMOTE, allowing the editing step to be applied to both the minority and majority class.
Random oversampling involves randomly selecting examples from the minority class, without replacement, and adding them to the training dataset. Random undersampling involves randomly
selecting examples from the majority class and deleting them from the training dataset. True/False
P 130
False
Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset. Random undersampling involves randomly
selecting examples from the majority class and deleting them from the training dataset.
These methods are referred to as naive sampling methods because they assume nothing about the data and no heuristics are used.
Random over/under sampling is simple to implement and fast to execute, which is desirable for ____ datasets.
P 130
very large and complex
The random oversampling may increase the likelihood of occurring overfitting, Why?
P 131
Because it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cover one replicated example.
How can we know if random oversampling has caused overfitting?
P 131
To gain insight into the impact of the method, it is a good idea to monitor the
performance on both train and test datasets after oversampling and compare the results to the same algorithm on the original dataset.