Chapter 3: Classification Flashcards
Classification Task
The data for a classification task consists of a collection of instances (records).
Each such instance is characterized by a tuple (x, y), where x is the set of attribute values that describe the instance and y is the class label of the instance.
The attribute set x can contain attributes of any type, while the class label y must be categorical.
A classification model is an abstract representation between the attribute set and the class label.
2 important roles of a classification model in data mining
- Used as a predictive model to classify unlabeled instances.
- Serves as a descriptive model to identify the characteristics that distinguish instances from different classes.
Accuracy
Number of correct predictions
/
Total number of predictions
Error rate
Number of wrong predictions
/
Total number of predictions
3 Types of DT nodes
- A root node with no incoming links and zero outgoing links.
- Internal nodes, each of which has exactly one incoming link and two or more outgoing links.
- Leaf or terminal nodes, each of which has exactly one incoming link and no outgoing links.
Hunt’s Algorithm
A decision tree is grown in a recursive fashion.
The tree initially contains a single root node that is associated with all the training instances.
If a node is associated with instances from more than one class, it is expanded using an attribute test condition that is determined using a splitting criterion.
A child leaf node is creater for each outcome of the attribute test condition and the instances associated with the parent node are distributed to the children based on the test outcomes.
This node expansion step can then be recursively applied to each child node, as long as it has labels of more than one class.
If all the instances associated with a leaf node have identical class labels, then the node is not expanded any further.
Each leaf node is assigned a class label that occurs most frequently in the training instances associated with the node.
2 Key design issues to be addressed with Hunt’s algorithm
1) What is the splitting criterion?
At each recursive step, an attribute must be selected to partition the training instances associated with a node into smaller subsets associated with its child nodes.
The splitting criterion determines which attributed is chosen.
2) What is the stopping criterion?
The basic algorithm stops expanding a node only when all the training instances have the same class labels or identical attribute values. However, early termination can have advantages.
Impurity of a node
The impurity of a node measures how dissimilar the class labels are for the data instances belonging to a common node.
E.g.
- Entropy
- Gini index
- Classification error
Gain in purity
The degree of impurity of the parent node, compared with the weighted degree of impurity of the child nodes.
Δ=I(parent) − I(children)
Characteristics of Decision Tree Classifiers
Applicability
Decision trees are a nonparametric approach for building classification models.
Doe not require any prior assumption about the probability distribution governing the class and attributes of the data.
It is applicable to both categorical and continuous data without requiring attributes to be transformed into a common representation via binarization, normalization or standardization.
Unlike some binary classifiers, it can also deal with multiclass problems without the need to decompose them into multiple binary classification tasks.
Trees are relatively easy to interpret.
The accuracy of trees are quite comparable to other classification teqnichues for many simple datasets.
Characteristics of Decision Tree Classifiers
Expressiveness
A decision tree provides a universal representation for discrete-valued functions. I.e. it can encode any function of discrete-valued attributes. This is because every discrete-valued function can be represented as an assignment table, where every unique combination of discrete attributes is assigned a class label.
Characteristics of Decision Tree Classifiers
Computational Efficiency
Since the number of possible decision trees can be very large, many decision tree algorithms employ a heuristic-based approach to guide their search in the vast hypothesis space.
For many datasets, such techniques quickly construct a reasonably good decision tree even when the training set size is very large.
Classifying a test record is extremely fast, with a worst-case complexity of O(w)
, where w
is the maximum depth of the tree.
Characteristics of Decision Tree Classifiers
3 Approaches to Handling missing values
- Probabilistic split method
- Surrogate split method
- Separate class method
DT handling missing values
Probabilistic split method
Distributes the data instance to every child of the splitting node according to the probability that the missing attribute has a particular value.
DT handling missing values
Surrogate split method
The instance whose splitting attribute value is missing is assigned to one of the child nodes based on the value of another non-missing surrogate attribute whose splits most resemble the partitions made by the missing attribute.
DT handling missing values
Separate class method
The missing value is treated as a separate categorical value distinct from other values of the splitting attribute.
Characteristics of Decision Tree Classifiers
Handling Interactions among Attributes
Attributes are considered interacting if they are able to distinguish between classes when used together, but individually they provide little or no information.
Due to the greedy nature of the splitting criteria in decision trees, such attributes could be passed over in favor of other attributes that are not as useful.
This could result in more complex decision trees than necessary.
Hence, decision trees can perform poorly when there are interactions among attributes.
Characteristics of Decision Tree Classifiers
Handling irrelevant attributes
An attribute is irrelevant if it is not useful for the classification task. Since irrelevant attributes are poorly associated with the target class labels, they will provide little or no gain in purity and thus will be passed over by other more relevant features.
Hence, the precense of a small number of irrelevant attributes will not impact the decision tree construction process.
However, not all attributes that provide little to no gain are irrelevant. Hence, if the classification problem is complex and there are a large number of irrelevant attributes, then some of the attributes may be accidentally chosen during the tree-growing process, since they may provide a better gain than a relevant attribute by random chance.
Feature selection techniques can help to improve the accuracy of decision trees by eliminating the irrelevant attributes during preprocessing.
Characteristics of Decision Tree Classifiers
Handling Redundant Attributes
An attribute is redundant if it is strongly correlated with another attribute in the data.
Since redundant attributes show similar gains in purity if they are selected for splitting, only one of them will be selected as an attribute test condition in the decision tree algorithm.
Decision trees can thus handle the presence of redundant attributes.
Characteristics of Decision Tree Classifiers
Using Rectilinear Splits
The tree procedure can be viewed as the process of partitioning the attribute space into disjoint regions until each region contains records of the same class.
The border between any two neighbouring regions of different classes is known as a decision boundary.
Since the test conditions only involve a single attribute, the decision boundaries are rectilinear, i.e., parallel to the coordinate axes.
This limits the expressiveness of decision trees in representing decision boundaries of datasets with continuous attributes.
An oblique decision tree may overcome this limitation by allowing the test condition to be specified using more than one attribute (e.g. x + y < 20).
Although an oblique decision tree is more expressive and can produce more compact trees, finding the optimal test condition is computationally more expensive.
Characteristics of Decision Tree Classifiers
Choice of Impurity Measure
The choice of impurity measure often has little effect on the performance of decision tree classifiers since many of the impurity measures are quite consistent with each other.
Instead, the strategy used to prune the tree has a greater impact on the final tree than the choice of impurity measure.
Underfitting
Underfitting occurs when the learned decision tree is too simplistic, and thus, incapable of fully representing the true relationship between the attributes and the class labels.
Model overfitting
The phenomena where, in the pursuit of minimizing the training error rate, an overly complex model is selected that captures specific patterns in the training data but fails to learn the true nature of relationships between attributes and class labels in the overall data.
2 Major factors that influence model overfitting
- Limited training size
- High model complexity
2 Major factors that influence model overfitting
Limited Training Size
A training set consisting of a finite number of instances can only provide a limited representation of overall data.
Hence, it is possible that the patterns learned from a training set do not fully represent the true patterns in the overall data, leading to model overfitting.
2 Major factors that influence model overfitting
High Model Complexity
Generally, a more complex model has a better ability to represent complex patterns in the data.
However, an overly complex model also has a tendency to learn specific patterns in the training set that do not generalize well over unseen instances.
Models with high complexity should thus be judiciously used to avoid overfitting.
Model selection
The process of selecting a model with the right level of complexity, which is able to generalize well over unseen test instances.
Occam’s razor
The principle of parsimony
Given two models with the same errors, the simpler model is preferred over the more complex model.
Prepruning
(Early Stopping Rule)
In this approach, the tree-growing algorithm is halted before generating a fully grown tree that perfectly fits the entire training data.
To do this, a more restrictive stopping condition must be used, e.g. stop expanding a leaf node when the observed gain in the generalization error estimate falls below a certain threshold.
Advantage of prepruning
It avoids the computations associated with generating overly complex subtrees that overfit the training data.
Major drawback of prepruning
Even if no significant gain is obtained using one of the existing splitting criterion, subsequent splitting may result in better subtrees.
Such subtrees would not be reached if prepruning is used because of the greedy nature of decision tree induction.
Post-pruning
In this approach, the decision tree is initially grown to its maximum size.
This is followed by a tree-pruning step, which proceeds to trip the fully grown tree in a bottom-up fashion.
Trimming can be done by subtree replacement or subtree raising.
Post-pruning
Subtree replacement
A subtree is replaced with a new leaf node whose class label is determined from the majority class of instances affiliated with the subtree.
Post-pruning
Subtree Raising
The subtree is replaced with the most frequently used branch of the subtree.