Unit 2: Classification with Decision Trees Flashcards by Rayno Mostert

Classification

Definition

Given a collection of records (dataset)
- Each record is characterised by a tuple (x, y), where x is the attribute set and y is the class label.

Task:
- Learn a model that maps each attribute set x into one of the predefined class labels y.

How well did you know this?

Not at all

Perfectly

Inference

The process of using a model in order to infer the class or a given record.

How well did you know this?

Not at all

Perfectly

Gini impurity

A metric describing how pure or mixed the data labels in a node are.

Pure nodes have Gini Impurity close to 0, where impure notes have a Gini impurity value closer to 0.5.

How well did you know this?

Not at all

Perfectly

Information Gain of a Split

Refers to the amount of information gained by choosing one of the possible splits.

Information Gain
EQUALS
Impurity of the Parent
MINUS
Weighted Average Impurity of the Children

How well did you know this?

Not at all

Perfectly

CART Algorithm

Grow (Data, attributes):

if (number of rows |Data| < h):
* Create (a leaf that has the dominant label of the Data)
* Return the leaf with its label.

Else:
* AttributesGain = InfoGain(Attributes)
* attribute = Max(AttributesGain)
* bsNode = Create (attribute)
* For each value of the attribute,
* - Data = Split(attribute, value)
* - Child = Grow (Data, Attributes)
* - bsNode = Add (Child to bsNode)

How well did you know this?

Not at all

Perfectly

Entropy Formula

-Σᵢ₌₁ⁿ pᵢ log(pᵢ)

How well did you know this?

Not at all

Perfectly

Gini Formula

Σᵢ₌₁ⁿ pᵢ (1 - pᵢ)

How well did you know this?

Not at all

Perfectly

Classification error

1 - max(pᵢ)

How well did you know this?

Not at all

Perfectly

Accuracy of a classifier

The proportion of correctly classified instances

(TP + TN) / N

How well did you know this?

Not at all

Perfectly

Error rate

Proportion of incorrectly classified instances.

(FP + FN) / N

How well did you know this?

Not at all

Perfectly

General Structure of Hunt’s Algorithm

Let Dₜ be the set of training records that read a node t

General Procedure:
- If Dₜ contains records that belong to the same class yₜ, then t is a leaf node labeled as yₜ
- If Dₜ contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.

Recursively apply the procedure to each subset.

How well did you know this?

Not at all

Perfectly

Hunt’s Algorithm

2 Design Issues of Decision Tree Induction

How should training records be split?
- Method for specifying the test condition (which depends on attribute types).
- Measure for evaluating the goodness of a test condition.

How should the splitting procedure stop?
- Stop splitting if all the records belong to the same class or have identical attribute values.
- Early termination.

How well did you know this?

Not at all

Perfectly

4 Attribute types in a DT test condition

Binary
Nominal
Ordinal
Continuous

How well did you know this?

Not at all

Perfectly

2 ways of splitting based on continuous attributes

Discretization to form an ordinal categorical
attribute.
Ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
Binary Decision (A < v) or (A ≥ v)
Consider all possible splits and finds the best cut. This can be more compute intensive.

How well did you know this?

Not at all

Perfectly

Decision trees

Finding the best split

Compute the impurity measure (P) before splitting
Compute the impurity measure (M) after splitting. M is the weighted impurity of child nodes.
Choose the attribute test condition that produces the highest gain.

Gaim = P - M

How well did you know this?

Not at all

Perfectly

5 Advantages of Decision-Tree based classification

Study These Flashcards

Relatively inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Robust to noise (especially when methods to avoid overfitting are employed)
Can easily handle redundant or irrelevant attributes (unless the attributes are interacting)

2 Disadvantages of Decision Tree-based classification

Study These Flashcards

Due to the greedy nature of the splitting criterion, interacting attributes may be passed over in favour of other attributes that are less discriminating.
Each decision boundary involves only a single attribute.

Type I error

Study These Flashcards

False Positive

Type II error

Study These Flashcards

False Negative

Holistic Metrics

Study These Flashcards

Metrics that look at both horizontal and vertical views and involve both positive and negative classes.

E.g. F1 score & Matthew Correlation Coefficient

Precision

Study These Flashcards

Refers to the precision of the prediction of the classifier with respect to the positive class.

A.k.a. Positive prediction value

Recall

true positive rate / hit rate

Study These Flashcards

Measures how good the classifier is at detecting the actual positive cases.

positive detection rate

p(detect₊)

Study These Flashcards

TP / (TP + FN)

How many of the positives were caught by the model?

NOT positive detection rate

¬ p(detect₊)

Study These Flashcards

FN / (TP + FN)

How many of the positives were missed by the model?

negative detection rate | *p(detect₋)*

TN / (TN + FP) | How many of the negatives were caught by the model?

NOT negative detection rate | *¬p(detect₋)*

FP / (TN + FP) | How many of the negatives were missed by the model?

positive prediction rate | *p(predict₊)*

TP / (TP + FP) | How many of those predicted to be positive, were actually positive?

NOT positive prediction rate | *¬p(predict₊)*

FP / (TP + FP) | How many of those predicted to be positive, were actually negative?

negative prediction rate | *p(predict₋)*

TN / (TN + FN) | How many of those predicted to be negative, were actually negartive?

NOT negative prediction rate | *¬p(predict₋)*

FN / (TN + FN) | How many of those predicted to be negative, were actually positive?

Accuracy for a multi-class problem

∑ᵢ{ TCᵢ } / N | TCᵢ is the count of the correctly classified instances of class Cᵢ

Holistic metric for multiclass *p(detect)* or *p(predict)* | recall score / balanced accuracy score / precision score

1. Take the arithmetic mean of these rates. 2. Take a weighted average of the detection rates. This allows assigning different weights for the different classes.

Holistic metric for multiclass: F1 Score

To generalise the F1 score to multi-class problems, we can: 1. Either use overall recall and precision and then apply the harmonic mean, as in the binary class problem. 2. We can treat the classes as one vs. the rest and obtain the F1 score for each class separately. We can then take the average of the F1 score for all these classes.

Model generalisation

Using the training set to come up with a **general relationship** between the features and the class that can be utilised to later classify **unseen** examples.

Unit 2: Classification with Decision Trees Flashcards

(34 cards)