Esemble Classifier Flashcards
Why combining classifiers?
stuck with the bias inherent in a given algorithm if only use one
ensemble learning
constructs a set of base classifiers from a given set of
training data and aggregates the outputs into a single
meta-classifier so that:
• the combination of lots of weak classifiers can be at least as good as one strong classifier • the combination of a selection of strong classifiers is (usually) at least as good as the best of the base classifiers
voting
• for a nominal class set, run multiple base classifiers over the test data and select the class predicted by the most base classifiers
• for a continuous class set, average over the numeric predictions of our base classifiers
Approaches to Classifier Combination
- Instance manipulation (most common)
- Feature manipulation (most common)
- Class label manipulation
- Algorithm manipulation
Instance manipulation
generate multiple training
datasets through sampling, and train a base classifier over each
Feature manipulation
generate multiple training
datasets through different feature subsets, and train a base classifier over each
Class label manipulation
generate multiple training
datasets by manipulating the class labels in a ireversible
manner
Algorithm workspaces
semi-randomly “tweak” internal parameters within a given algorithm to generate multiple base classifiers over a given datasetquiz
4 popular esemble methods
- stacking
- bagging
- random forest
- Boosting
Stacking
Basic intuition: “smooth” errors over a range of algorithms with different biases
• Method 1: simple voting
presupposes the classifiers have equal performance
• Method 2: train a classifier over the outputs of the base
classifiers (meta-classification)
train using nested cross validation to reduce bias (usually Logistic Regression)
Pros of stacking
Mathematically simple but computationally expensive
method
• Able to combine heterogeneous classifiers with varying
performance
• Generally, stacking results in as good or better results than
the best of the base classifiers
• Widely seen in applied research; less interest within
theoretical circles (esp. statistical learning)
bagging/bootstrap aggregating
Basic intuition: the more data, the better the performance
(lower the variance), so how can we get ever more data out
of a fixed training dataset?
Construct “novel” datasets through a combination
of random sampling and replacement
• Randomly sample N’ the original dataset N times, with replacement (same instance can be selected over and over again)
• Thus, we get a new dataset of the same size, where any individual instance is absent with probability (1 −1/N)^N
• construct k random datasets for k base classifiers, and
arrive at prediction via voting
bagging/bootstrap aggregating
Basic intuition: the more data, the better the performance
(lower the variance), so how can we get ever more data out
of a fixed training dataset?
Construct “novel” datasets through a combination
of random sampling and replacement
• Randomly sample N’ the original dataset N times, with replacement (same instance can be selected over and over again)
• Thus, we get a new dataset of the same size, where any individual instance is absent with probability (1 −1/N)^N
• construct k random datasets for k base classifiers, and
arrive at prediction via voting
• The same classification algorithm is used throughout
• As bagging is aimed towards minimising variance through
sampling, the algorithm should be unstable ( =
high-variance)
• high variance: DT (if a few instances are excluded, the whole model might be different)
• low variance: SVM (hard margin; soft margin wouldn’t help much), LR (the overall result wouldn’t change much anyway)
Pros of Bagging
Pros:
• Simple method based on sampling and voting
• Possibility to parallelise computation of individual base classifiers
• Highly effective over noisy datasets (outliers may vanish)
• Performance is generally significantly better than the base classifiers (esp. DT) and only occasionally substantially worse
Random Tree
A “Random Tree” is a Decision Tree where:
• At each node, only some of the possible attributes that are randomly selected are
considered
• Attempts to control for unhelpful attributes in the feature set (DT does not do that)
• Much faster to build than a “deterministic” Decision Tree,
but increases model variance (which is our goal because high variance is good for bagging)