Chapter 6.10: Ensemble Methods Flashcards by Rayno Mostert

Ensemble method

An ensemble method constructs a set of base classifiers from training data and performs classification by taking a vote on the predictions made by each base classifier.

How well did you know this?

Not at all

Perfectly

2 Necessary conditions for an ensemble classifier to perform better than a single classifier

The base classifiers should be independent of each other
The base classifiers should do better than a classifier that performs random guessing.

How well did you know this?

Not at all

Perfectly

4 Methods for Constructing an Ensemble Classifier

By manipulating:
- the training set
- the input features
- the class labels
- the learning algorithm

How well did you know this?

Not at all

Perfectly

4 Methods for Constructing an Ensemble Classifier

Manipulating the training set

to construct an ensemble classifier

In this approach, multiple training sets are created by resampling the original data according to some sampling distribution and constructing a classifier from each training set.

E.g. bagging & boosting

How well did you know this?

Not at all

Perfectly

4 Methods for Constructing an Ensemble Classifier

Manipulating the input features

to construct an ensemble classifier

In this approach, a subset of input features is chosen to form each training set.

The subset can be either chosen randomly or based on the recommendation of domain experts.

Random forests manipulates its input features and uses decision trees as its base classifiers.

How well did you know this?

Not at all

Perfectly

4 Methods for Constructing an Ensemble Classifier

Manipulating the class labels

to construct an ensemble classifier

This method can be used when the number of classes is sufficiently large.

The training data is transformed into a binary class problem by randomly partitioning the class labels into two disjoint subsets, A₀ and A₁.

The relabeled examples are then used to train a base classifier. By repeating this process multiple times, an ensemble of base classifiers is obtained.

When a test example is presented, each base classifier Cᵢ is used to predict its class label. If the test example is predicted as class 0, then all the classes that belong to A₀ will receive a vote. The votes are tallied and the class that receives the highest vote is assigned to the test example.

How well did you know this?

Not at all

Perfectly

4 Methods for Constructing an Ensemble Classifier

Manipulating the learning algorithm

to construct an ensemble classifier

Many learning algorithms can be manipulated in such a way that applying the algorithm several times on the same training data will result in the construction of different classifiers.

E.g. an artificial neural network can change its topology or the initial weights of the links between neurons.

Similarly, an ensemble of decision trees can be constructed by injecting randomness into the tree-growing procedure.

How well did you know this?

Not at all

Perfectly

When do ensemble methods show the most improvement?

When used with unstable classifiers.

I.e. base classifiers that are sensitive to minor perturbations in the training set, because of high model complexity.

Although unstable classifiers may have a low bias in finding the optimal decision boundary, their predictions have a high variance for minor changes in the training set or model selection.

How well did you know this?

Not at all

Perfectly

Bagging

a.k.a Bootstrap Aggregation

Bagging is a technique that repeatedly samples (with replacement) from a dataset according to a uniform probability distribution.

Each bootstrap sample has the same size as the original data.

Because the sampling is done with replacement, some instances may appear several times in the training set, while others may be omitted.

How well did you know this?

Not at all

Perfectly

On average, how much of the original training data is included in a bootstrap sample?

63%.

Each sample has a probability
~~~
1 - (1 - 1 / n)ⁿ
~~~

of being selected.

If n is sufficiently large, this probability converges to 1 - 1 / e ~= 0.632

How well did you know this?

Not at all

Perfectly

Algorithm 6.5

Bagging Algorithm

Let k be the number of bootstrap samples.
for i=1 to k do
. . Create a bootstrap sample of size N, Dᵢ
. . Train a base classifier Cᵢ on the bootstrap sample Dᵢ.
end for
C*(x) = argmax_y 𝚺ᵢ δ (Cᵢ(x) = y)

δ(.) = 1 if its argument is true and 0 otherwise.

How well did you know this?

Not at all

Perfectly

Boosting

Boosting is an iterative procedure to adaptively change the distribution of training examples for learning base classifiers so that they increasingly focus on examples that are hard to classify.

Unlike bagging, boosting assigns a weight to each training example and may adaptively change the weight at the end of each boosting round.

How well did you know this?

Not at all

Perfectly

Boosting

2 Ways in which the weights assigned to the training examples can be used.

They can be used to inform the sampling distribution used to draw a set of bootstrap samples from the original data.
They can be used to learn a model that is biased toward examples with higher weight.

How well did you know this?

Not at all

Perfectly

Random forests

Random forests attempt to improve the generalization performance by constructing an ensemble of decorrelated decision trees.

Random forests build on the idea of bagging to use a different bootstrap sample of the training data for learning decision trees.

However, a key distinguishing feature of random forests from bagging is that at every internal node of a tree, the best splitting criterion is chosen among a small set of randomly selected attributes.

In this way, random forests construct ensembles of decision trees by not only manipulating training instances (by using bootstrap samples similar to bagging), but also the input attributes (by using different subsets of attributes at every internal node).

How well did you know this?

Not at all

Perfectly

Basic Random Forest training procedure

Given a set D consisting of n instances and d attributes.

Construct a bootstrap sample Dᵢ of the training set by randomly sampling n instances (with replacement) from D.
Use Dᵢ to learn a decision tree Tᵢ as follows:
- - At every internal node of Tᵢ, randomly sample a set of p attributes and choose an attribute from this subset that shows the maximum reduction in impurity measure for splitting. Repeat this procedure until every leaf is pure.

Once an ensemble of decision trees have been constructed, their average prediction (majority vote) on a test instance is used as the final prediction of the random forest.

How well did you know this?

Not at all

Perfectly

Key attributes of a random forest

The decision trees involved in a random forest are unpruned trees, as they are allowed to grow to their largest possible size until every leaf is pure.
Hence, the base classifiers of a random forest represent unstable classifiers that have low bias but high variance, because of their large size.

There is lack of correlation among the model parameters and test predictions of the base classifiers. This can be attributed to the use of an independently sampled dataset Dᵢ for learning every decision tree Tᵢ , similar to the bagging approach. However, random forests have the additional advantage of choosing a splitting criterion at every internal node using a different (and randomly selected) subset of attributes. This property significantly helps in breaking the correlation structure, if any, among the decision trees.

How well did you know this?

Not at all

Perfectly

Class Imbalance

Study These Flashcards

When there are a disproportionate number of instances that belong to different classes.

aka skew

2 Challenges of class imbalance for classification

Study These Flashcards

Training. It can be difficult to find sufficiently many labeled samples for a rare class. Classifiers are all impacted if the minority class is not well-represented in the training set. In general, a classifier trained over an imbalanced dataset will show a bias toward improving its performace over the majority class, which is often not the desired behaviour.
Accuracy, which is the traditional measure for evaluating classification performance is not well-suited for evaluating models in the presence of class imbalance in the test data.

2 Types of sampling methods used to enhance the representation of the minority class

Study These Flashcards

undersampling, where the frequency of the majority class is reduced to match the frequency of the minority class.

oversampling, where artificial examples of the minority class are created to make them equal in proportion to the number of negative instances.

True Positive Count (TP)

Study These Flashcards

f++, corresponds to the number of positive examples correctly predicted by the classifier.

False Positive Count (FP)

Study These Flashcards

f-+, corresponds to the number of negative examples wrongly predicted as positive by the classifier.

Also known as Type I error

False Negative Count (FN)

Study These Flashcards

f+-, corresponds to the number of positive examples wrongly predicted to be negative by the classifier.

Also known as Type II error

True Negative Count (TN)

Study These Flashcards

f–, corresponds to the number of negative examples correctly predicted by the classifier.

True positive rate

Study These Flashcards

The fraction of positive test instances correctly predicted by the classifier:
TPR = TP / (TP + FN)

In the medical community, TPR is known as sensitivity.
In the information retrieval literature, it is called recall (r).

True negative rate

The fraction of negative test instances correctly predicted by the classifier: `TNR = TN / (FP + TN)` | Also known as **specificity**

False positive rate

1 - TNR `FPR = FP / (FP + TN)`

False negative rate

1 - TPR `FNR = FN / (FN + TP)`

**skew** among the classes

`α = P / (P+N)` Where `P` and `N` denote the actual number of positives and actual negatives.

Precision

The fraction of correct predictions of the positive class over the total number of positive predictions, i.e. `p = TP / (TP + FP)` Precision is sensitive to skew. | Also referred to as the **positive predicted value** (PPV)

False discovery ratio

`1-p` `FDR = FP / (TP + FP)`

Positive Likelihood Ratio

TPR / FPR

F1 Measure

`2rp / (r + p)` | `p` ~ precision & `r` ~ recall

G measure

`sqrt(rp)` | `p` ~ precision & `r` ~ recall

Aggregate evaluation of performance

Assessing the performance of a classifier over a range of score thresholds is called aggregate evaluation of performance. In this style of analysis, the emphasis is not on evaluating the performance of a single classifier corresponding to the optimal score threshold, but to assess the general quality of ranking produced by the classification scores on the test set.

# Aggregate evaluation of performance ROC Curve

The **received operating characteristic** curve is a graphical approach for displaying the trade-off between TPR and FPR of a classifier, over varying score thresholds. In an ROC curve, the TPR is plotted along the *y*-axis and the FPR is shown on the *x*-asix. Each point along the curve corresponds to a classification model generated by placing a threshold on the test scores produced by the classifier.

# ROC curve Operating points

Since every point on the ROC curve represents the performance of a classifier generated using a particular score threshold, they can be viewed as different **operating points** of the classifier. One may choose one of these operating points depending on the requirements of the application.

# Aggregate evaluation of performance AUC

The **area under the ROC curve** (AUC) summarises the aggregate behaviour across all operating points. If the **classifier is perfect**, then its area under the ROC curve will be equal to **1**. If the classifier simply performs **random guessing**, then the area under the ROC curve will be equal to **0.5**.

# Aggregate evaluation of performance Precision-Recall Curve

An alternate tool for aggregate evaluation is the **precision-recall curve**. The PR curve plots the precision and recall values of a classifier on the *y* and *x* axes respectively, by varying the threshold on the test scores.

Key distinguishing features in the PR curve.

- PR curves are sensitive to the skew factor α = P/(P + N), and different PR curves are generated for different values of α. - When the score threshold is lowest (every instance is labeled as positive), the precision is equal to α while recall is 1. As the score threshold is increased, the number of predicted positives can stay the same or decrease.

Key characteristic of ROC curves

They are agnostic to the skew in the test set, because both the evaluation measures used in constructing ROC curves (TPR and FPR) are invariant to class imbalance.

Chapter 6.10: Ensemble Methods Flashcards

(40 cards)