7.1 Classifier Combination Flashcards
What is boosting?
- Intuition: tune base classifiers to focus on the hard-to-classify instances
- Method: iteratively change the distribution and weights of training instances to reflect the performance of the classifier on the previous iteration
What is Bagging?
Bagging = bootstrap aggregating
- Intuition: the more data, the better performance (lower the variance), so how can we get more data out of a fixed training dataset?
- Method: construct new datasets through a combination of random sampling and replacement
What are the techniques that use instance manipulation approach to combine classifiers?
- Boosting
- Bagging
- Random Forests
Bagging constructs multiple new datasets through a random sampling of instances with replacement to train multiple classifiers. Random Forest adopts the same bagging technique to generate multiple datasets for different random trees. Boosting also iteratively samples instances from the training dataset (to train multiple classifiers) while assigning more weights to the instances that are not correctly classified in the previous iteration.
In contrast, Stacking introduces meta classifier to decide which base classifiers to rely on.
Which of the following statement(s) are TRUE about ensemble learning?
- An ensemble of classifiers may not be able to outperform any of its individual base learners.
- Combining meaningful base learners improves the generalizability of the model.
Ensembling diverse meaningful base learners typically yields better results and generalized models. However, it is not always guaranteed to have improved performance by ensembling.
Which of the following statement(s) are TRUE about Random Forest? Group of answer choices
- Random Forest adopts both feature manipulation and instance manipulation approaches.
Random forest adopts instance manipulation to train multiple random trees using different bagged datasets. For each random tree, feature manipulation is used to consider different feature combinations at different nodes. By training multiple random trees with different bagged datasets, random forest reduces the variance (not the bias). The predictions made by a random tree can be explained by following the decisions made along the tree. However, combining multiple random trees using a voting mechanism (i.e., random forest) degrades the interpretability of the overall logic.
Which of the following statement(s) are TRUE about Boosting?
Boosting assigns higher weights to better-performing base learners
Boosting adopts a weighted voting strategy to combine base learners based on the importance of each base learner. Boosting is an instance manipulation technique, where the wrongly predicted samples (i.e., difficult samples) are iteratively emphasized.
Suppose there are 3 independent binary classifiers C1, C2, and C3, with error rates 0.3, 0.2, and 0.2 respectively. If the classifiers are combined by majority voting, what is the error rate of the combined classifier?
0.136
To make an error by the combined classifier, at least two classifiers should make errors. There are four scenarios to generate wrong predictions:
incorrect classifiers
error rate
{C1, C2}
0.3*0.2*(1-0.2) = 0.048
{C2, C3}
0.3*(1-0.2)*0.2 = 0.048
{C1, C3}
(1-0.3)*0.2*0.2 = 0.028
{C1, C2, C3}
0.3*0.2*0.2 = 0.012
Thus, the error rate of the combined classifier is 0.048+0.048+0.028+0.012=0.136
What is a random forest?
An ensemble of Random Trees, many trees = forest
- Each tree is built using a different Bagged training dataset
- The combined classification is via voting
What is stacking?
Intuition:
smooth errors over a range of algorithms with different biases
Method 1:
voting? Which classifier to trust?
Method 2:
train a meta-classifier (level-1 model)
over the outputs of the base classifiers (level-0
model)
- learn which classifiers are the reliable ones, and combine the output of base classifiers
- train using nested cross validation to reduce bias