Chapter 6.10: Ensemble Methods Flashcards
Ensemble method
An ensemble method constructs a set of base classifiers from training data and performs classification by taking a vote on the predictions made by each base classifier.
2 Necessary conditions for an ensemble classifier to perform better than a single classifier
- The base classifiers should be independent of each other
- The base classifiers should do better than a classifier that performs random guessing.
4 Methods for Constructing an Ensemble Classifier
By manipulating:
- the training set
- the input features
- the class labels
- the learning algorithm
4 Methods for Constructing an Ensemble Classifier
Manipulating the training set
to construct an ensemble classifier
In this approach, multiple training sets are created by resampling the original data according to some sampling distribution and constructing a classifier from each training set.
E.g. bagging & boosting
4 Methods for Constructing an Ensemble Classifier
Manipulating the input features
to construct an ensemble classifier
In this approach, a subset of input features is chosen to form each training set.
The subset can be either chosen randomly or based on the recommendation of domain experts.
Random forests manipulates its input features and uses decision trees as its base classifiers.
4 Methods for Constructing an Ensemble Classifier
Manipulating the class labels
to construct an ensemble classifier
This method can be used when the number of classes is sufficiently large.
The training data is transformed into a binary class problem by randomly partitioning the class labels into two disjoint subsets, A₀ and A₁.
The relabeled examples are then used to train a base classifier. By repeating this process multiple times, an ensemble of base classifiers is obtained.
When a test example is presented, each base classifier Cᵢ is used to predict its class label. If the test example is predicted as class 0, then all the classes that belong to A₀ will receive a vote. The votes are tallied and the class that receives the highest vote is assigned to the test example.
4 Methods for Constructing an Ensemble Classifier
Manipulating the learning algorithm
to construct an ensemble classifier
Many learning algorithms can be manipulated in such a way that applying the algorithm several times on the same training data will result in the construction of different classifiers.
E.g. an artificial neural network can change its topology or the initial weights of the links between neurons.
Similarly, an ensemble of decision trees can be constructed by injecting randomness into the tree-growing procedure.
When do ensemble methods show the most improvement?
When used with unstable classifiers.
I.e. base classifiers that are sensitive to minor perturbations in the training set, because of high model complexity.
Although unstable classifiers may have a low bias in finding the optimal decision boundary, their predictions have a high variance for minor changes in the training set or model selection.
Bagging
a.k.a Bootstrap Aggregation
Bagging is a technique that repeatedly samples (with replacement) from a dataset according to a uniform probability distribution.
Each bootstrap sample has the same size as the original data.
Because the sampling is done with replacement, some instances may appear several times in the training set, while others may be omitted.
On average, how much of the original training data is included in a bootstrap sample?
63%.
Each sample has a probability
~~~
1 - (1 - 1 / n)ⁿ
~~~
of being selected.
If n is sufficiently large, this probability converges to 1 - 1 / e ~= 0.632
Algorithm 6.5
Bagging Algorithm
- Let
k
be the number of bootstrap samples. -
for
i=1
tok
do - . . Create a bootstrap sample of size
N
,Dᵢ
- . . Train a base classifier
Cᵢ
on the bootstrap sampleDᵢ
. - end for
C*(x) = argmax_y 𝚺ᵢ δ (Cᵢ(x) = y)
δ(.) = 1 if its argument is true and 0 otherwise.
Boosting
Boosting is an iterative procedure to adaptively change the distribution of training examples for learning base classifiers so that they increasingly focus on examples that are hard to classify.
Unlike bagging, boosting assigns a weight to each training example and may adaptively change the weight at the end of each boosting round.
Boosting
2 Ways in which the weights assigned to the training examples can be used.
- They can be used to inform the sampling distribution used to draw a set of bootstrap samples from the original data.
- They can be used to learn a model that is biased toward examples with higher weight.
Random forests
Random forests attempt to improve the generalization performance by constructing an ensemble of decorrelated decision trees.
Random forests build on the idea of bagging to use a different bootstrap sample of the training data for learning decision trees.
However, a key distinguishing feature of random forests from bagging is that at every internal node of a tree, the best splitting criterion is chosen among a small set of randomly selected attributes.
In this way, random forests construct ensembles of decision trees by not only manipulating training instances (by using bootstrap samples similar to bagging), but also the input attributes (by using different subsets of attributes at every internal node).
Basic Random Forest training procedure
Given a set D consisting of n instances and d attributes.
- Construct a bootstrap sample Dᵢ of the training set by randomly sampling n instances (with replacement) from D.
- Use Dᵢ to learn a decision tree Tᵢ as follows:
- - At every internal node of Tᵢ, randomly sample a set of p attributes and choose an attribute from this subset that shows the maximum reduction in impurity measure for splitting. Repeat this procedure until every leaf is pure.
Once an ensemble of decision trees have been constructed, their average prediction (majority vote) on a test instance is used as the final prediction of the random forest.
Key attributes of a random forest
The decision trees involved in a random forest are unpruned trees, as they are allowed to grow to their largest possible size until every leaf is pure.
Hence, the base classifiers of a random forest represent unstable classifiers that have low bias but high variance, because of their large size.
There is lack of correlation among the model parameters and test predictions of the base classifiers. This can be attributed to the use of an independently sampled dataset Dᵢ for learning every decision tree Tᵢ , similar to the bagging approach. However, random forests have the additional advantage of choosing a splitting criterion at every internal node using a different (and randomly selected) subset of attributes. This property significantly helps in breaking the correlation structure, if any, among the decision trees.