Ensembles Flashcards
Two characteristics of good ensembles
- Individual models should be strong
2. The correlation between the models in the ensemble should be weak (diversity)
What is bagging?
- Trains N models in parallel using bootstrapped data samples, from an overall training set
- Aggregates using majority voting
- Bootstrapped aggregating = bagging
Bayes Optimal Ensemble
- An ensemble of all hypotheses in the hypothesis space
- On average, no other ensemble can outperform it
- not possible to practically implement
- Tom Mitchell book (1997)
What some practical approaches to ensembling?
- Bagging
- Random Forests
- Boosting
- Gradient Boosting
- Stacking
Subagging
Bagging, but where we sample without replacement
- Used in large datasets where we want to create bootstrap samples that are smaller than the original dataset
What type of learning algorithms are suited to bagging?
- Decision Trees
- DTs are very sensitive to changes and so small changes in a dataset can result in a different feature being selected to split the dataset at the root, or high up in the tree and this can have a ripple effect throughout the subtrees under this node
What is subspace sampling?
- A bootstrap sample only uses a randomly selected subset of the descriptive features in the dataset
- Encourages further diversity of the trees within the ensemble
- Has the advantage of reducing training time for each tree
Random Forests
= Bagging + Subspace sampling
Advantages of Bagging over Boosting
- Simpler to implement + parallelize
- Ease of use
- Reduced training time
However, Caruana et al. (2008) showed that boosted DT ensembles performed best for datasets < 4000 desc. features.
> 4000 descriptive features -> random forests (based on bagging) performed better
> Boosted ensembles may be prone to overfitting and in domains with large numbers of features, overfitting becomes a serious probelm
Costs of using ensembles
- Increased model complexity
2. Increasing learning
What types of algorithms does Bagging work well for?
Unstable algos - algos whose output classifier undergoes major changes in response to small changes in the training data
Examples of unstable classifiers
DTs, NNs and rule-learning algos are all unstable
Examples of stable classifiers
- Linear regression
- Nearest neighbour
- Linear threshold algorithm
Bootstrap replicate
Training set created from bagging procedure
Contains on average 63.2% of the original training set, with several training examples appearing multiple times (Diettrich paper)
Methods for manipulating training datasets
- Bagging
- K-fold cross-validation (ensembles constructed in this way are called ‘cross-validation committees’)
- Boosting
AdaBoost
Freund & Schapire (1995-8)
- manipulates the training examples to generate multiple hypotheses
- maintains a set of weights over the the training examples
- Effect of the change in weights is to place more weight on training examples that were misclassified by h(l) and less weight on examples that were correctly classified
- In subsequent iterations, therefore, Adaboost constructs progressively more difficult learning problems
- final classifier is constructed by a weighted vote of the individual classifiers. Each classifier is weighted (by w(l)) according to its accuracy on the weighted training set that it was trained on
- can be viewed as trying to maximise the margin (confidence of accuracy) on the training data
- constructs each new DT to eliminate ‘residual errors’ that have not been properly handled by the weighted vote of the previously-constructed trees.
- thus, it is directly trying to optimise the weighted vote
and is therefore making a direct attack on the representational problem - directly optimising an ensemble can increase the rik of overfitting
- in high-noise cases, Adaboost puts a large amount of weight on the mislabelled examples and this leads to it overfitting badly
Bagging vs AdaBoost
Diettrich paper: AdaBoost typically outperforms Bagging but when 20% artificial classification noise was added, AdaBoost overfit the data badly while Bagging did very well in the presence of noise.
Why doesn’t Adaboost overfit more often?
- Stage-wise nature of AdaBoost
- In each iteration, it reweights the training examples, constructs a new hypothesis, and chooses a weight for that hypothesis.
- HOWEVER, it never backs up and modifies the previous choices of hypotheses or weights that it has made to compensate for this new hypothesis
What is an ensemble?
A prediction model that is composed of a set of models
What is the motivation behind ensembles?
- A committee of experts working together on a problem are more likely to solve it successfully than a single expert working alone
- however, should still avoid GROUPTHINK (i.e. each model should make predictions independently of the other models in the ensemble)
Two defining characteristics of ensembles
- Use a modified version of dataset
2. Aggregates predictions from many models
How can ensembles lead to good predictions from base learners that are only marginally better than random guessing?
- Given a large population of independent models, an ensemble can be very accurate even if the individual models in the ensemble perform only marginally better than random guessing
What are the 2 standard approaches to ensembling?
- Bagging
2. Boosting
Bagging
Bootstrap aggregation
Bagging
Each model in the ensemble is trained on a random sample of the dataset
- each random sample is the same size as the original dataset (unless subagging is used)
- sampling with replacement is used
Boosting
Each new model added to an ensemble is biased to pay more attention to instances that previous models misclassified
How does boosting work?
By incrementally adapting the dataset used to train the models
- uses a weighted dataset
- each instance has an associated weight (w(i) >= 0)
- weights are initially set to 1/n (n=number of examples)
- these weights are used as a distribution over which the dataset is sampled to create a replicated training set
- number of times that an instance is replicated is proportional to its weight
How does boosting work?
Iteratively creating models and adding them to the ensemble
When do the iterations stop in boosting?
- predefined number of models have been added
- model’s accuracy dips below 0.5
Assumptions of boosting algorithm
- Accuracy of models > 0.5
2. Assumes it’s a binary classification problem
What are the boosting algorithm steps?
- Induces a model using the weighted dataset and calculates the total error, E, in the set of predictions made by the model for the instances in the training set. The E value is calculated by summing the weights of the training instances for which the predictions by the model are incorrect.
- Increases weights for instances misclassified by the model: w[i] = w[i] * (1 / (2 *E))
- Decreases the weights for the instances classified correctly by the model: w[i] = w[i] * (1 / (2*(1-E))
- Calculates a confidence factor, such that alpha increases as E decreases. alpha = (0.5) * log((1-E)/E)
What are the bagging algorithm steps?
- Train N models in parallel using bootstrapped data samples from an overall training set
- Aggregates using majority voting
Why do we sample with replacement in bagging?
It results in duplicates within each of the bootstrap samples and consequently every bootstrap sample will be missing some instances too
- every bootstrap sample will be different, and so every model will be different
Disadvantages of ensembles
- Increased learning
2. Model complexity
Pros of Bagging
- Simpler to implement and parallelize
=> Ease of use and reduced training time
Evidence of ensembles
Caruana and Niculescu-Mizil (2006) found that bagging and boosting were the best performers out of 7 predictor types
Caruana et al. (2008) - boosted DTs were best performing model of those tested for datasets <4k descriptive features. For more than 4k features, RFs were better as boosting led to overfitting
Summary of Leo Breiman (1996) paper on ‘Bagging Predictors’
Bagging Predictors,
1996, Leo Breiman
bagging can push a good but unstable procedure towards optimality.
It can degrade the performance of stable procedures.
Gradient Boosting
- Extension of Boosting
- Iteratively train up ensemble
- Train models to predict residuals
e.g. XGBoost
How is Gradient Boosting similar to AdaBoost?
Creates an ensemble model by iteratively adding learners (similar to Ada)
How is Gradient Boosting different to Adaboost?
More aggressive - fitting each model directly to the errors of the ensemble (up to the current iteration) rather than a weighted dataset which is more subtle
How does Gradient Boosting work?
Best model to fit would be the model that predicts the difference between the old model’s prediction and the true prediction
t[i] - M(n-1)d[i]
- Uses gradient descent to reduce J (error in predictions)
Where does the gradient in Gradient Boosting come from?
Because we treat the residuals as the negative gradients of the loss function
Under the hood, we’re doing gradient descent on the error surface
Key idea of stacking
Stacking ensembles use a machine learning model to combine the outputs of the base models in an ensemble
Using the predictions of the base models as features at the stacked layer
Can be more effective than simple majority voting or weighted voting
Very common - heterogenous ensembles
Common to use k-folds to generate the stacked level training set
Requires new datasets to be generated for the stack layer
Cons of Stacking
Produces small gains for lots of extra complexity