Esemble Classifier Flashcards
Why combining classifiers?
stuck with the bias inherent in a given algorithm if only use one
ensemble learning
constructs a set of base classifiers from a given set of
training data and aggregates the outputs into a single
meta-classifier so that:
• the combination of lots of weak classifiers can be at least as good as one strong classifier • the combination of a selection of strong classifiers is (usually) at least as good as the best of the base classifiers
voting
• for a nominal class set, run multiple base classifiers over the test data and select the class predicted by the most base classifiers
• for a continuous class set, average over the numeric predictions of our base classifiers
Approaches to Classifier Combination
- Instance manipulation (most common)
- Feature manipulation (most common)
- Class label manipulation
- Algorithm manipulation
Instance manipulation
generate multiple training
datasets through sampling, and train a base classifier over each
Feature manipulation
generate multiple training
datasets through different feature subsets, and train a base classifier over each
Class label manipulation
generate multiple training
datasets by manipulating the class labels in a ireversible
manner
Algorithm workspaces
semi-randomly “tweak” internal parameters within a given algorithm to generate multiple base classifiers over a given datasetquiz
4 popular esemble methods
- stacking
- bagging
- random forest
- Boosting
Stacking
Basic intuition: “smooth” errors over a range of algorithms with different biases
• Method 1: simple voting
presupposes the classifiers have equal performance
• Method 2: train a classifier over the outputs of the base
classifiers (meta-classification)
train using nested cross validation to reduce bias (usually Logistic Regression)
Pros of stacking
Mathematically simple but computationally expensive
method
• Able to combine heterogeneous classifiers with varying
performance
• Generally, stacking results in as good or better results than
the best of the base classifiers
• Widely seen in applied research; less interest within
theoretical circles (esp. statistical learning)
bagging/bootstrap aggregating
Basic intuition: the more data, the better the performance
(lower the variance), so how can we get ever more data out
of a fixed training dataset?
Construct “novel” datasets through a combination
of random sampling and replacement
• Randomly sample N’ the original dataset N times, with replacement (same instance can be selected over and over again)
• Thus, we get a new dataset of the same size, where any individual instance is absent with probability (1 −1/N)^N
• construct k random datasets for k base classifiers, and
arrive at prediction via voting
bagging/bootstrap aggregating
Basic intuition: the more data, the better the performance
(lower the variance), so how can we get ever more data out
of a fixed training dataset?
Construct “novel” datasets through a combination
of random sampling and replacement
• Randomly sample N’ the original dataset N times, with replacement (same instance can be selected over and over again)
• Thus, we get a new dataset of the same size, where any individual instance is absent with probability (1 −1/N)^N
• construct k random datasets for k base classifiers, and
arrive at prediction via voting
• The same classification algorithm is used throughout
• As bagging is aimed towards minimising variance through
sampling, the algorithm should be unstable ( =
high-variance)
• high variance: DT (if a few instances are excluded, the whole model might be different)
• low variance: SVM (hard margin; soft margin wouldn’t help much), LR (the overall result wouldn’t change much anyway)
Pros of Bagging
Pros:
• Simple method based on sampling and voting
• Possibility to parallelise computation of individual base classifiers
• Highly effective over noisy datasets (outliers may vanish)
• Performance is generally significantly better than the base classifiers (esp. DT) and only occasionally substantially worse
Random Tree
A “Random Tree” is a Decision Tree where:
• At each node, only some of the possible attributes that are randomly selected are
considered
• Attempts to control for unhelpful attributes in the feature set (DT does not do that)
• Much faster to build than a “deterministic” Decision Tree,
but increases model variance (which is our goal because high variance is good for bagging)
Random Forests
An ensemble of Random Trees (many trees = forest) • Each tree is built using a different Bagged training dataset • As with Bagging the combined classification is via voting • The idea behind them is to minimise overall model variance, without introducing (combined) model bias
Pros & Cons of RF
Pros:
• Generally a very strong performer
• parallelisable & efficient
• Robust to overfitting
Cons:
• Interpretability sacrificed
Boosting
Basic intuition: tune base classifiers to focus on the “hard to classify” instances
Iteratively change the distribution and weights of
training instances to reflect the performance of the classifier on the previous iteration
• start with each training instance having a 1/N
probability of being included in the sample
• over T iterations, train a classifier and update the weight of each instance according to whether it is correctly classified
• combine the base classifiers via weighted voting
AdaBoost
alpha = importance of Ci = the weight associated with the classifier vote
if the error rate is low, alpha is more positive; if the error rate is high, alpha is more negative.
- Base classification algorithm: decision stumps (1-R) or decision trees
- reinitialise the instance weights whenever i > 0.5
Pros & Cons of Boosting
• Mathematically complicated but computationally cheap
method based on iterative sampling and weighted voting
• More computationally expensive than bagging
• The method has guaranteed performance in the form of
error bounds over the training data
• Interesting effect with convergence of the error rate over the
training vs. test data
• In practical applications, boosting has the tendency to
overfit
Bagging/RF vs. Boosting
Bagging/RF • Parallel sampling • Simple voting • Single classification algorithm • Minimise variance • Not prone to overfitting
Boosting • Iterative sampling • Weighted voting • Single classification algorithm • Minimise (instance) bias • Prone to overfitting