08 Ensemble Methods Flashcards

1
Q

Ensemble methods aim to combine multiple experts and vote for the best–improving predictive performance.

What are drawbacks of such ensemble methods?

A

Disadvantage:
* usually produces output that is very hard to analyze
* but: there are approaches that aim to produce a single comprehensible structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain bagging conceptually, going over its idealized version, performance improvement condition, and a main issue associated with it along a common remedy.

Further explain the four advantages of bagging.

A

Concept: Combine models with equal weights.

Ideally: sample several training sets of size n from the population, building a classifier for each and combining their predictions

Improvement condition: unstable learner: small change in training data can make big change in model (e.g. decision trees)

Main issue: We have only 1 training set. Resolve by bootstrapping, sampling with replacement as it reduces variance even if the samples are dependent.

Advantages: Can be help a lot when applied to noisy data, can be applied to both numeric and binary classification tasks, can be parallelized with other ensemble methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Differentiate between bias and variance.

  • Total expected error = bias + variance

Combining multiple classifiers generally …
* We assume …
The bias-variance decomposition to understand both components:
* Training error reflects …
* For example, bootstrap training data sets B from a data set D and evaluate against
samples in D-B to estimate bias and variance empirically.

A
  • Bias = expected error of the ensemble classifier on new data (high bias = underfitting)
  • Variance = component of the expected error due to the particular training set being used to
    build our classifier (high variance = overfitting)

decreases the expected error by reducing variance.

noise inherent in the data is part of the bias component as it cannot normally be measured.

bias but not variance; test error reflects both.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Random Forests
We can also randomize a learning algorithm instead of input

Can be combined with bagging
* when using decision trees, this yields…

Random forests randomize data and features!
* Random decision forests correct for …
* Initial proposal by Tin Kam Ho (1995)
* “Random Forests – Random Features” (Leo Breiman, 1997)

A
  • pick 𝑚 options at random from the full set of options, then choose the best of
    those 𝑚 choices
  • e.g., randomize the attribute selection in decision trees

the random forest method for building ensemble classifiers

decision trees’ habit of overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Boosting
Boosting combines …
* Also uses voting/averaging, but …
New models are influenced by performance of previously built ones
* New model to become an “expert” for…

  • Intuitive justification: models should be experts that complement each other

Bias vs. variance for Boosting, Bagging and Random Forest-.

A

several weak learners into a strong learner; weights models according to performance

instances misclassified by earlier models

  • Boosting tries to minimize the bias in terms of training performance of simple
    learners
  • Random forest aggregates fully grown trees each of which has low bias but high variance: the random forest reduces the variance of the final predictor by aggregating such trees
  • Bagging is aimed to reduce variance (error on test data) as well
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

More on Boosting
The training error of AdaBoost …
AdaBoost.M1 works well with so-called weak learners; …
* example of weak learner: decision stump (1-level decision trees)

Boosting needs weights, but boosting without weights can be applied: …

In practice, boosting can …
Boosting is related a more general idea that has been rediscovered in multiple fields with different names such as the multiplicative weights update method or additive models.

A

drops exponentially fast.

only condition: error does not exceed 0.5

resample data with probability determined by weights

overfit if too many iterations are performed (in contrast to bagging) resulting in poor performance on test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Gradient Boosted Trees (GBT)

Gradient Boosted Trees (GBTs) is related to FSAM and uses the same pseudo code.
The fundamental difference is in how residuals are computed and used. FSAM uses direct residuals (actual minus predicted values), while GBT …
The cross-entropy loss 𝐿 (𝑦,𝑝) =−∑𝑖 (𝑦𝑖 ∗log 𝑝𝑖 + 1−𝑦𝑖 log 1−𝑝𝑖 )between the distribution of labels 𝑦 and the predictions 𝑝 can be used as a loss function to …
Pseudo-residuals 𝑟 for binary classification are the negative derivative of the loss function
with respect to the predicted probability 𝑝:

A

uses pseudo-residuals derived from the gradient of some loss function.

quantify the difference between these distributions in binary classification.

dL(y,p)/dp = p-y/p(1-p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

FSAM vs. GBT
FSAM focuses on fitting new components to the residuals to directly correct errors, whereas GBT aims to …
FSAM may retain more interpretability due to its simpler components and approach, while GBRT often ..
MART and XGBoost are specific implementations or variations of the basic GBT framework. Hyperparameters such as …
The sequential process can be slow to train in both cases.

A

minimize the overall loss function by moving in the direction indicated by the negative gradient of the loss.

achieves higher predictive performance but can be less interpretable.

the learning rate or the number of trees combat overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly