Ensembles Flashcards

1
Q

Two characteristics of good ensembles

A
  1. Individual models should be strong

2. The correlation between the models in the ensemble should be weak (diversity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is bagging?

A
  • Trains N models in parallel using bootstrapped data samples, from an overall training set
  • Aggregates using majority voting
  • Bootstrapped aggregating = bagging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Bayes Optimal Ensemble

A
  • An ensemble of all hypotheses in the hypothesis space
  • On average, no other ensemble can outperform it
  • not possible to practically implement
  • Tom Mitchell book (1997)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What some practical approaches to ensembling?

A
  1. Bagging
  2. Random Forests
  3. Boosting
  4. Gradient Boosting
  5. Stacking
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Subagging

A

Bagging, but where we sample without replacement

  • Used in large datasets where we want to create bootstrap samples that are smaller than the original dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What type of learning algorithms are suited to bagging?

A
  • Decision Trees
  • DTs are very sensitive to changes and so small changes in a dataset can result in a different feature being selected to split the dataset at the root, or high up in the tree and this can have a ripple effect throughout the subtrees under this node
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is subspace sampling?

A
  • A bootstrap sample only uses a randomly selected subset of the descriptive features in the dataset
  • Encourages further diversity of the trees within the ensemble
  • Has the advantage of reducing training time for each tree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Random Forests

A

= Bagging + Subspace sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Advantages of Bagging over Boosting

A
  1. Simpler to implement + parallelize
  2. Ease of use
  3. Reduced training time

However, Caruana et al. (2008) showed that boosted DT ensembles performed best for datasets < 4000 desc. features.
> 4000 descriptive features -> random forests (based on bagging) performed better
> Boosted ensembles may be prone to overfitting and in domains with large numbers of features, overfitting becomes a serious probelm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Costs of using ensembles

A
  1. Increased model complexity

2. Increasing learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What types of algorithms does Bagging work well for?

A

Unstable algos - algos whose output classifier undergoes major changes in response to small changes in the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Examples of unstable classifiers

A

DTs, NNs and rule-learning algos are all unstable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Examples of stable classifiers

A
  • Linear regression
  • Nearest neighbour
  • Linear threshold algorithm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Bootstrap replicate

A

Training set created from bagging procedure

Contains on average 63.2% of the original training set, with several training examples appearing multiple times (Diettrich paper)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Methods for manipulating training datasets

A
  1. Bagging
  2. K-fold cross-validation (ensembles constructed in this way are called ‘cross-validation committees’)
  3. Boosting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

AdaBoost

A

Freund & Schapire (1995-8)

  • manipulates the training examples to generate multiple hypotheses
  • maintains a set of weights over the the training examples
  • Effect of the change in weights is to place more weight on training examples that were misclassified by h(l) and less weight on examples that were correctly classified
  • In subsequent iterations, therefore, Adaboost constructs progressively more difficult learning problems
  • final classifier is constructed by a weighted vote of the individual classifiers. Each classifier is weighted (by w(l)) according to its accuracy on the weighted training set that it was trained on
  • can be viewed as trying to maximise the margin (confidence of accuracy) on the training data
  • constructs each new DT to eliminate ‘residual errors’ that have not been properly handled by the weighted vote of the previously-constructed trees.
  • thus, it is directly trying to optimise the weighted vote
    and is therefore making a direct attack on the representational problem
  • directly optimising an ensemble can increase the rik of overfitting
  • in high-noise cases, Adaboost puts a large amount of weight on the mislabelled examples and this leads to it overfitting badly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Bagging vs AdaBoost

A

Diettrich paper: AdaBoost typically outperforms Bagging but when 20% artificial classification noise was added, AdaBoost overfit the data badly while Bagging did very well in the presence of noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why doesn’t Adaboost overfit more often?

A
  • Stage-wise nature of AdaBoost
  • In each iteration, it reweights the training examples, constructs a new hypothesis, and chooses a weight for that hypothesis.
  • HOWEVER, it never backs up and modifies the previous choices of hypotheses or weights that it has made to compensate for this new hypothesis
19
Q

What is an ensemble?

A

A prediction model that is composed of a set of models

20
Q

What is the motivation behind ensembles?

A
  • A committee of experts working together on a problem are more likely to solve it successfully than a single expert working alone
  • however, should still avoid GROUPTHINK (i.e. each model should make predictions independently of the other models in the ensemble)
21
Q

Two defining characteristics of ensembles

A
  1. Use a modified version of dataset

2. Aggregates predictions from many models

22
Q

How can ensembles lead to good predictions from base learners that are only marginally better than random guessing?

A
  • Given a large population of independent models, an ensemble can be very accurate even if the individual models in the ensemble perform only marginally better than random guessing
23
Q

What are the 2 standard approaches to ensembling?

A
  1. Bagging

2. Boosting

24
Q

Bagging

A

Bootstrap aggregation

25
Q

Bagging

A

Each model in the ensemble is trained on a random sample of the dataset

  • each random sample is the same size as the original dataset (unless subagging is used)
  • sampling with replacement is used
26
Q

Boosting

A

Each new model added to an ensemble is biased to pay more attention to instances that previous models misclassified

27
Q

How does boosting work?

A

By incrementally adapting the dataset used to train the models

  • uses a weighted dataset
  • each instance has an associated weight (w(i) >= 0)
  • weights are initially set to 1/n (n=number of examples)
  • these weights are used as a distribution over which the dataset is sampled to create a replicated training set
  • number of times that an instance is replicated is proportional to its weight
28
Q

How does boosting work?

A

Iteratively creating models and adding them to the ensemble

29
Q

When do the iterations stop in boosting?

A
  • predefined number of models have been added

- model’s accuracy dips below 0.5

30
Q

Assumptions of boosting algorithm

A
  1. Accuracy of models > 0.5

2. Assumes it’s a binary classification problem

31
Q

What are the boosting algorithm steps?

A
  1. Induces a model using the weighted dataset and calculates the total error, E, in the set of predictions made by the model for the instances in the training set. The E value is calculated by summing the weights of the training instances for which the predictions by the model are incorrect.
  2. Increases weights for instances misclassified by the model: w[i] = w[i] * (1 / (2 *E))
  3. Decreases the weights for the instances classified correctly by the model: w[i] = w[i] * (1 / (2*(1-E))
  4. Calculates a confidence factor, such that alpha increases as E decreases. alpha = (0.5) * log((1-E)/E)
32
Q

What are the bagging algorithm steps?

A
  1. Train N models in parallel using bootstrapped data samples from an overall training set
  2. Aggregates using majority voting
33
Q

Why do we sample with replacement in bagging?

A

It results in duplicates within each of the bootstrap samples and consequently every bootstrap sample will be missing some instances too

  • every bootstrap sample will be different, and so every model will be different
34
Q

Disadvantages of ensembles

A
  1. Increased learning

2. Model complexity

35
Q

Pros of Bagging

A
  1. Simpler to implement and parallelize

=> Ease of use and reduced training time

36
Q

Evidence of ensembles

A

Caruana and Niculescu-Mizil (2006) found that bagging and boosting were the best performers out of 7 predictor types

Caruana et al. (2008) - boosted DTs were best performing model of those tested for datasets <4k descriptive features. For more than 4k features, RFs were better as boosting led to overfitting

37
Q

Summary of Leo Breiman (1996) paper on ‘Bagging Predictors’

A

Bagging Predictors,
1996, Leo Breiman

bagging can push a good but unstable procedure towards optimality.
It can degrade the performance of stable procedures.

38
Q

Gradient Boosting

A
  • Extension of Boosting
  • Iteratively train up ensemble
  • Train models to predict residuals

e.g. XGBoost

39
Q

How is Gradient Boosting similar to AdaBoost?

A

Creates an ensemble model by iteratively adding learners (similar to Ada)

40
Q

How is Gradient Boosting different to Adaboost?

A

More aggressive - fitting each model directly to the errors of the ensemble (up to the current iteration) rather than a weighted dataset which is more subtle

41
Q

How does Gradient Boosting work?

A

Best model to fit would be the model that predicts the difference between the old model’s prediction and the true prediction

t[i] - M(n-1)d[i]

  • Uses gradient descent to reduce J (error in predictions)
42
Q

Where does the gradient in Gradient Boosting come from?

A

Because we treat the residuals as the negative gradients of the loss function

Under the hood, we’re doing gradient descent on the error surface

43
Q

Key idea of stacking

A

Stacking ensembles use a machine learning model to combine the outputs of the base models in an ensemble

Using the predictions of the base models as features at the stacked layer

Can be more effective than simple majority voting or weighted voting

Very common - heterogenous ensembles

Common to use k-folds to generate the stacked level training set

Requires new datasets to be generated for the stack layer

44
Q

Cons of Stacking

A

Produces small gains for lots of extra complexity