Unit 2: Classification with Decision Trees Flashcards
Classification
Definition
Given a collection of records (dataset)
- Each record is characterised by a tuple (x, y), where x is the attribute set and y is the class label.
Task:
- Learn a model that maps each attribute set x into one of the predefined class labels y.
Inference
The process of using a model in order to infer the class or a given record.
Gini impurity
A metric describing how pure or mixed the data labels in a node are.
Pure nodes have Gini Impurity close to 0, where impure notes have a Gini impurity value closer to 0.5.
Information Gain of a Split
Refers to the amount of information gained by choosing one of the possible splits.
Information Gain
EQUALS
Impurity of the Parent
MINUS
Weighted Average Impurity of the Children
CART Algorithm
Grow (Data, attributes):
if (number of rows |Data| < h):
* Create (a leaf that has the dominant label of the Data)
* Return the leaf with its label.
Else:
* AttributesGain = InfoGain(Attributes)
* attribute = Max(AttributesGain)
* bsNode = Create (attribute)
* For each value of the attribute,
* - Data = Split(attribute, value)
* - Child = Grow (Data, Attributes)
* - bsNode = Add (Child to bsNode)
Entropy Formula
-Σᵢ₌₁ⁿ pᵢ log(pᵢ)
Gini Formula
Σᵢ₌₁ⁿ pᵢ (1 - pᵢ)
Classification error
1 - max(pᵢ)
Accuracy of a classifier
The proportion of correctly classified instances
(TP + TN) / N
Error rate
Proportion of incorrectly classified instances.
(FP + FN) / N
General Structure of Hunt’s Algorithm
Let Dₜ be the set of training records that read a node t
General Procedure:
- If Dₜ contains records that belong to the same class yₜ, then t is a leaf node labeled as yₜ
- If Dₜ contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.
Recursively apply the procedure to each subset.
Hunt’s Algorithm
2 Design Issues of Decision Tree Induction
How should training records be split?
- Method for specifying the test condition (which depends on attribute types).
- Measure for evaluating the goodness of a test condition.
How should the splitting procedure stop?
- Stop splitting if all the records belong to the same class or have identical attribute values.
- Early termination.
4 Attribute types in a DT test condition
- Binary
- Nominal
- Ordinal
- Continuous
2 ways of splitting based on continuous attributes
-
Discretization to form an ordinal categorical
attribute.
Ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. -
Binary Decision (A < v) or (A ≥ v)
Consider all possible splits and finds the best cut. This can be more compute intensive.
Decision trees
Finding the best split
- Compute the impurity measure (P) before splitting
- Compute the impurity measure (M) after splitting. M is the weighted impurity of child nodes.
- Choose the attribute test condition that produces the highest gain.
Gaim = P - M