Topic 5: Ensemble Methods Flashcards

Question 1

Q

What does ensemble mean

Answer

A

Having multiple predictive models and combining their predictions
-> treating them as a “committee”

Question 2

Q

What is a Committee

Answer

A

A group of predictive models

Question 3

Q

What is the underlying principle of ensemble methods

Answer

A

That in real-world situations, every model has limitations and will make some errors

Question 4

Q

What models can the ensemble algorithm be applied to

Answer

A

Most
from simple decision trees to deep CNN

Question 5

Q

What is Jensen’s Inequality

Answer

A

If X is a random variable and f(x) is a convex function, then:

E [ f(X) ] ≥ f (E [X])

Essentially distance of the mean guess x¯ from the truth is less than (or equal to) the average of the distances for each individual guess xi

So the error of combined values is less than the average error of each individual value

Question 6

Q

What is jensen’s inequality applied to ensemble methods

Answer

A

The squared error of the ensemble is guaranteed to be less than or equal to the average squared error of the individual estimators

Question 7

Q

What is the relationship between ensemble size and probability of ensemble error

Answer

A

If we fix the individual error to ε = 0.3 (Probability that one model makes an error)
And then increase the ensemble size
The probability of ensemble error decrease

If decrease ε, the whole graph (ensemble size v error is squished in the y axis (moved down))

Question 8

Q

What is the assumption we are making in ensemble probability models

Answer

A

Each model is independent

Question 9

Q

What is the issue with increasing the number of model in an ensemble with the same set training data

Answer

A

In order for each model to be independent (necessary requirement), the training data has to be split n times (for n models) - using the assumption that all data is IID

However, decreasing training dataset sizes means each model is less and less accurate on testing data

Question 10

Q

Why do we need parallel algorithms

Answer

A

To try and create “diversity” between classifiers in Ensembles, whilst maintaining a reasonable degree of accuracy
Eg Bagging and random forests

Question 11

Q

What is the bagging algorithm

Answer

A

“Bootstrap AGGregating”

Given a datset size n, randomly select n examples WITH replacement
Eg from (n = 1-6) : 1 1 2 4 5 5
This is a ‘bootstrap’

Question 12

Q

What is a data bootstrap

Answer

A

a random sample of examples from our original dataset

Question 13

Q

For a classification problem, how do you combine model results

Answer

A

Majority vote

Question 14

Q

For a regression problem, how do you combine model results

Answer

A

Take the average

Question 15

Q

How can you use non-uniform weights in the bagging algorithm

Answer

A

Requires an Extra holdout dataset separate form train and test data
Instead of simply averaging the predictions from each model, different weights are assigned to each model’s prediction based on their performance
This data helps assess the performance of the model with different assigned weights

Question 16

Q

What is the risk of using non-uniform weights in the bagging algorithm

Answer

A

Over fitting
Weights may become overly tuned to specific characteristics

Question 17

Q

Bagging: are we actually discarding a lot of training data in our random sampling?

Answer

A

No, our classifiers are statistically dependent with each other
Almost all of the dataset is certain to be used

Question 18

Q

What is the probability of a piece of data being included in a bootstrap

Answer

A

p = 1 - (1 - 1/n)^n

Because 1/n probability of selecting it
1 - 1/n probability of not selecting it
(1 - 1/n)^n probability of it never being selected in the whole n size bootstrap

So 1 - (1 - 1/n)^n of it being included in the bootstrap
Which converges roughly to 0..6321

Question 19

Q

Relationship of ensemble size to probability of data being excluded

Answer

A

By ensemble size 10 there is a <1% chance that a piece of data is excluded

Question 20

Q

Why are more classifiers dependent on each other for naive bayes than decision trees

Answer

A

Because naive bayes has a higher bias
Thus the number of possible models to be generated is highly restricted
Lots of models are practically identical

Question 21

Q

How if overfitting helpful in bootstrapping

Answer

A

A single tree using a bootstrap overfits more than a single tree using the whole dataset
This means that the bootstraps are more different from each other and therefore not dependent -> which is helping us

(The ensemble then cancels out these individual overfitting errors)

Question 22

Q

What is random forests

Answer

A

An ensemble algorithm only for decision trees
“a forest of randomised trees”

Question 23

Q

How does random forests work

Answer

A

Bootstrap then a random selection of features at each split point

Take a bootstrap S′n from Sn
Build a tree using S′n, but, at every split point:
- Choose a random fraction K of the remaining features.
- Pick the best feature from that subset
Add the tree to the set

End
Return set of trees

Essentially the bootstrap tree is forced not to choose the best feature at every split, but in a random way

Question 24

Q

What is the common value of K (random forests)

Answer

A

K = √d
where d is the total number of features

Question 25

Q

Random forests vs bagging

Answer

A

RF helps reduce the test error further than bagging (as number of trees increases)

Question 26

Q

Why is RF not a precise algorithm

Answer

A

It is a “methodology”
there are many ways to randomise the tree structures

Question 27

Q

What is a sequential algorithm

Answer

A

where each classifier aims to correct the
errors of its predecessors
Eg Boosting

Question 28

Q

What is Boosting

Answer

A

Each stage involves:
- training one new model
- identify which mistakes it made in
the training data
- misclassified training examples are emphasised
- correctly classified examples have their emphasis decreased

This generates a new dataset, which is used to train the next model in the boosting chain

Question 29

Q

How do we model “emphasis”

Answer

A

Represented by a probability distribution P

Question 30

Q

What is Adaboost

Answer

A

Adaptive Boosting
The most well known of the boosting family of procedures
Incredible fast

Question 31

Q

How does Adaboost work

Answer

A

Each classifier is:
ht(x) ∈ {-1, +1}

(Builds one classifier at a time)
from t=1 to M
builds ht
Calculates the error (ϵt) ht makes on the dataset
Assigns it an importance (αt) (higher importance if it performs better)
Udpates the distribution to Pt+1 (new probability dist) which gives more weight to misclassified samples (vice versa)
Renormalises the distribution to ensure that the weights sum up to 1

Finally adaboost determines the weights of all the classifiers - αt

Question 32

Q

What is αt in Adaboost

Answer

A

The importance (αt) of the classifier based on its error
If the error is low, the importance will be high, and vice versa
The formula for importance (αt) involves the natural logarithm of the ratio of correct to incorrect predictions

Question 33

Q

Where do bagging and boosting overlap

Answer

A

In the first iteration of boosting when
P1 is uniform, this is exactly equivalent to Bagging

Question 34

Q

Why does Adaboost minimise an exponential loss function not itself

Answer

A

Because classification is dicontinuous 0/1 it is hard to minimise
Instead Adaboost indirectly minimises itself by minimising its upper bound (an exponential function)

Question 35

Q

How is the exp loss function applied in adaboost

Answer

A

The exponential loss function is decomposed into individual terms for each data point

Recursively, the relative weight of each data point at time step t+1 is updated based on the previous weights and the performance of the current classifier

Question 36

Q

what is wt(i)

Answer

A

the relative importance of each data point i at time step t

Question 37

Q

What is the “Ambiguity decomposition”

Answer

A

The ensemble will always have an error less than or equal to the average

Question 38

Q

How do we minimise adaboost classification error

Answer

A

We cannot directly minimise the error because of the discontinuity of the error function
Create an upper bound (on the classification error) that is exponential and minimise that instead

Question 39

Q

How does the exponential bound on adaboost work

Answer

A

It assigns a higher penalty when the ensemble makes incorrect predictions
As predictions become more positive (correct) the loss reduces (y axis) and the graph approaches the x asymptote
Opposingly, as predictions become more negative (incorrect) the graph grows exponentially
(negative gradient)

Question 40

Q

What does ‘parallel’ mean

Answer

A

Ability to train multiple base models concurrently or independently

Question 41

Q

When does RF work best

Answer

A

When there are a large number of features — allowing for lots of diversity in the forest

Question 42

Q

what does overfitting look like in training and testing errors

Answer

A

training error significantly < testing error

Question 43

Q

what does underfitting look like in training and testing errors

Answer

A

high training and testing errors

Question 44

Q

What is Et in adaboost

Answer

A

The loss function at each time step t
Et = 1/n ∑ e^(-αtyiht(xi))

Brainscape's Knowledge GenomeTM

Topic 5: Ensemble Methods Flashcards

Brainscape's Knowledge Genome^TM