Chapter 23 Ensemble Algorithms Flashcards
What’s a bootstrap sample?
P 295
It involves first selecting random samples from the training dataset with replacement, meaning that a given sample may contain zero, one, or more than one copy of examples in the training dataset. This is called a bootstrap sample.
The process of creating new bootstrap samples and fitting and adding trees to the sample can continue until no further improvement is seen in the ensemble’s performance on a validation dataset. This simple procedure often results in better performance than a single well-configured decision tree algorithm. True/False
P 295
True
An easy way to overcome class imbalance problem when facing the resampling stage in bagging is: ____
P 296
to take the classes of the instances into account when they are randomly drawn from the original dataset.
Basically in cost-sensitive ensemble, we balance the training set, before bootstrapping it for use in the ensemble model, it’s like a wrapper around the ensemble model, making it cost-sensitive
Oversampling the minority class in the bootstrap is referred to as ____ ; likewise, undersampling the majority class in the bootstrap is referred to as ____ , and combining
both approaches is referred to as ____.
P 297
OverBagging
UnderBagging
OverUnderBagging
The imbalanced-learn library provides an implementation of UnderBagging. Specifically, it provides a version of bagging that uses a random undersampling strategy on the majority class within a bootstrap sample in order to balance the two classes. This is provided in the ____ class.
P 297
BalancedBaggingClassifier
What are the differences and similarities between decision trees and random forest?
P 298
Random forest is another ensemble of decision tree models and may be considered an improvement upon bagging. Like bagging, random forest involves selecting bootstrap samples from the training dataset and fitting a decision tree on each.
The main difference is that all features (variables or columns) are not used; instead, a small, randomly selected subset of features (columns) is chosen for each bootstrap sample. This has the effect of de-correlating the decision trees (making them more independent), and in turn, improving the ensemble prediction.
How is the class_weight set in RandomForestClassifier, when this argument’s value is ‘balanced_subsample’?
P 300
Given that each decision tree is constructed from a bootstrap sample (e.g. random selection with replacement), the class distribution in the data sample will be different for each tree. As such, it might be interesting to change the class weighting based on the class distribution in each bootstrap sample in Random Forest, instead of the entire training dataset. This can be achieved by setting the class weight argument to the value ‘balanced_subsample’.
What’s the difference between using class_weight=”balanced” and class_weight=”balanced_subsample” in RandomForestClassifier?
P 300
class_weight=”balanced_subsample” changes the class weighting based on the class distribution in each bootstrap sample, class_weight=”balanced” does this for the entire training dataset.
What does Balanced Random Forest model from imblearn library do to fix the data imbalance problem?
P 301
Another useful modification to random forest is to perform data sampling on the bootstrap sample in order to explicitly change the class distribution. The BalancedRandomForestClassifier class from the imbalanced-learn library implements this and performs random undersampling of the majority class in each bootstrap sample. This is generally referred to as Balanced Random Forest.
How does Easy Ensemble work?
P 303
The Easy Ensemble involves creating balanced samples of the training dataset by selecting all examples from the minority class and a subset from the majority class. Rather than using pruned decision trees, boosted decision trees are used on each subset, specifically the AdaBoost algorithm.
The process can be repeated multiple times and the average prediction across the ensemble of models can be used to make predictions.
Although an AdaBoost classifier is used on each subsample, alternate classifier models can be used via setting the base estimator argument to the model.
Under-sampling is an efficient strategy to deal with class-imbalance. However, the drawback of under-sampling is that it throws away many potentially useful data. How is this problem avoided in Easy Ensemble (or Balanced Random Forest and BalancedBaggingClassifier) in which the subsamples are under-sampled?
P 302
The generation of multiple subsamples allows the ensemble to overcome the downside of undersampling in which valuable information is discarded from the training process.
Easy Ensemble combines bagging and boosting for imbalanced classification. True/False
P 305
True
It uses the result of multiple adaboost base estimatiors to return the final result
Is balancing the training data done before bootstrapping in BalancedBaggingClassifier?
External