Part 2: Decision Trees Flashcards
Decision tree
A decision tree is used to predict the outcome. There are as many splits as attributes. After determining which the possible splits are, the best splits have to be determined to end up with pure nodes. To make this possible, classes are used. To end up with pure nodes, information gain and entropy is used.
Entropy (E)
Measures the uncertainty. When you have a pure node, E = 0. When there is a high uncertainty E = 1.
Information gain (I)
Measures the increase/decrease in uncertainty given certain information.
Theory regarding entropy
New entropy is always less than the original entropy, which causes an information gain.
Split on attributes
- Number of variables - the splitting is.
+ Univariate - only one variable is tested.
+ Multivariate - more than one variable is tested at once. - Type of splitting variable(s)
+ Nominal - e.g., temp.=cool, temp.=mild, temp.=hot
+ Numerical - e.g., temp.<30, temp. >30. - Number of outcome splits
+ Two (binary) - trees with only binary splits are called binary trees.
+ Multiway split
Decision boundary
Borderline between two neighbouring regions of different classes. Decision boundary is parallel to axes because test condition involves a single attribute at-a-time.
Types of decision trees
Type of predicted variable (end node):
- Nominal (class) = majority of label of instances in node Classification trees.
- Numerical value = average value of instances in node Regression trees.
- Equation = regression equation with least squares fir Model trees.
Tree induction
- Strategy
+ Split the records based on an attribute test that optimizes certain criterion (information gain). - Important issues
+ Determine how to split records
+ How to specify the attribute test condition?
+ How to determine the best split?
+ Determine when to stop splitting. This is also important to keep in mind when there is noice in the data. Data with noise doesn’t have pure nodes, so if you want to reach pure nodes, you’re modelling noise.
Measures for node impurity (a.k.a. when to stop splitting)
- Misclassification error.
- Entropy (this is the preferred option because in general this provides the smallest trees).
- Gini index
Quality of a split
Objective: obtaining pure nodes, i.e., nodes that contain objects from a single class (if node is impure, take majority class as label). - Measures for impurity \+ Misclassification error - fraction of objects that are classified correctly and part incorrectly if we assign every object to the majority class in that node.
Splitting based on classification error
Classification error at a node t:
- Error(t) = 1 - maxP(i|t)
Measures misclassification error made by a node
- Maximum (1-1/nc), when records are equally distributed among all classes.
- Minimum (0.0) when all records belong to one class pure node.
For efficient computation
For each attribute:
- Sort the attribute on values.
- Linearly scan these values, each time updating the count matrix and computing entropy or index.
- Choose the split position that has the least entropy or index.
Regression tree overfitting
Divide the x axis in parts in such a way that in each part the value of the node is equal to the average of the red points. By continuing splitting you can fit the red points arbitrarily well.
- 2 splits
- 4 splits
- 6 splits -> very sensitive to noise.
Growing the tree
Using the training data:
- Split nodes base on impurity criterium.
- Until all leave nodes are pure (impurity = 0)
Problem: if the tree is too large, the model also contains noise of the training data.
Underfitting
When model is too simple, both training and test errors are large.