Unit 2: Classification with Decision Trees Flashcards
Classification
Definition
Given a collection of records (dataset)
- Each record is characterised by a tuple (x, y), where x is the attribute set and y is the class label.
Task:
- Learn a model that maps each attribute set x into one of the predefined class labels y.
Inference
The process of using a model in order to infer the class or a given record.
Gini impurity
A metric describing how pure or mixed the data labels in a node are.
Pure nodes have Gini Impurity close to 0, where impure notes have a Gini impurity value closer to 0.5.
Information Gain of a Split
Refers to the amount of information gained by choosing one of the possible splits.
Information Gain
EQUALS
Impurity of the Parent
MINUS
Weighted Average Impurity of the Children
CART Algorithm
Grow (Data, attributes):
if (number of rows |Data| < h):
* Create (a leaf that has the dominant label of the Data)
* Return the leaf with its label.
Else:
* AttributesGain = InfoGain(Attributes)
* attribute = Max(AttributesGain)
* bsNode = Create (attribute)
* For each value of the attribute,
* - Data = Split(attribute, value)
* - Child = Grow (Data, Attributes)
* - bsNode = Add (Child to bsNode)
Entropy Formula
-Σᵢ₌₁ⁿ pᵢ log(pᵢ)
Gini Formula
Σᵢ₌₁ⁿ pᵢ (1 - pᵢ)
Classification error
1 - max(pᵢ)
Accuracy of a classifier
The proportion of correctly classified instances
(TP + TN) / N
Error rate
Proportion of incorrectly classified instances.
(FP + FN) / N
General Structure of Hunt’s Algorithm
Let Dₜ be the set of training records that read a node t
General Procedure:
- If Dₜ contains records that belong to the same class yₜ, then t is a leaf node labeled as yₜ
- If Dₜ contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.
Recursively apply the procedure to each subset.
Hunt’s Algorithm
2 Design Issues of Decision Tree Induction
How should training records be split?
- Method for specifying the test condition (which depends on attribute types).
- Measure for evaluating the goodness of a test condition.
How should the splitting procedure stop?
- Stop splitting if all the records belong to the same class or have identical attribute values.
- Early termination.
4 Attribute types in a DT test condition
- Binary
- Nominal
- Ordinal
- Continuous
2 ways of splitting based on continuous attributes
-
Discretization to form an ordinal categorical
attribute.
Ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. -
Binary Decision (A < v) or (A ≥ v)
Consider all possible splits and finds the best cut. This can be more compute intensive.
Decision trees
Finding the best split
- Compute the impurity measure (P) before splitting
- Compute the impurity measure (M) after splitting. M is the weighted impurity of child nodes.
- Choose the attribute test condition that produces the highest gain.
Gaim = P - M
5 Advantages of Decision-Tree based classification
- Relatively inexpensive to construct
- Extremely fast at classifying unknown records
- Easy to interpret for small-sized trees
- Robust to noise (especially when methods to avoid overfitting are employed)
- Can easily handle redundant or irrelevant attributes (unless the attributes are interacting)
2 Disadvantages of Decision Tree-based classification
- Due to the greedy nature of the splitting criterion, interacting attributes may be passed over in favour of other attributes that are less discriminating.
- Each decision boundary involves only a single attribute.
Type I error
False Positive
Type II error
False Negative
Holistic Metrics
Metrics that look at both horizontal and vertical views and involve both positive and negative classes.
E.g. F1 score & Matthew Correlation Coefficient
Precision
Refers to the precision of the prediction of the classifier with respect to the positive class.
A.k.a. Positive prediction value
Recall
true positive rate / hit rate
Measures how good the classifier is at detecting the actual positive cases.
positive detection rate
p(detect₊)
TP / (TP + FN)
How many of the positives were caught by the model?
NOT positive detection rate
¬ p(detect₊)
FN / (TP + FN)
How many of the positives were missed by the model?
negative detection rate
p(detect₋)
TN / (TN + FP)
How many of the negatives were caught by the model?
NOT negative detection rate
¬p(detect₋)
FP / (TN + FP)
How many of the negatives were missed by the model?
positive prediction rate
p(predict₊)
TP / (TP + FP)
How many of those predicted to be positive, were actually positive?
NOT positive prediction rate
¬p(predict₊)
FP / (TP + FP)
How many of those predicted to be positive, were actually negative?
negative prediction rate
p(predict₋)
TN / (TN + FN)
How many of those predicted to be negative, were actually negartive?
NOT negative prediction rate
¬p(predict₋)
FN / (TN + FN)
How many of those predicted to be negative, were actually positive?
Accuracy for a multi-class problem
∑ᵢ{ TCᵢ } / N
TCᵢ is the count of the correctly classified instances of class Cᵢ
Holistic metric for multiclass p(detect) or p(predict)
recall score / balanced accuracy score / precision score
- Take the arithmetic mean of these rates.
- Take a weighted average of the detection rates. This allows assigning different weights for the different classes.
Holistic metric for multiclass: F1 Score
To generalise the F1 score to multi-class problems, we can:
1. Either use overall recall and precision and then apply the harmonic mean, as in the binary class problem.
2. We can treat the classes as one vs. the rest and obtain the F1 score for each class separately. We can then take the average of the F1 score for all these classes.
Model generalisation
Using the training set to come up with a general relationship between the features and the class that can be utilised to later classify unseen examples.