Part 2: Decision Trees Flashcards

1
Q

Decision tree

A

A decision tree is used to predict the outcome. There are as many splits as attributes. After determining which the possible splits are, the best splits have to be determined to end up with pure nodes. To make this possible, classes are used. To end up with pure nodes, information gain and entropy is used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Entropy (E)

A

Measures the uncertainty. When you have a pure node, E = 0. When there is a high uncertainty E = 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Information gain (I)

A

Measures the increase/decrease in uncertainty given certain information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Theory regarding entropy

A

New entropy is always less than the original entropy, which causes an information gain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Split on attributes

A
  • Number of variables - the splitting is.
    + Univariate - only one variable is tested.
    + Multivariate - more than one variable is tested at once.
  • Type of splitting variable(s)
    + Nominal - e.g., temp.=cool, temp.=mild, temp.=hot
    + Numerical - e.g., temp.<30, temp. >30.
  • Number of outcome splits
    + Two (binary) - trees with only binary splits are called binary trees.
    + Multiway split
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Decision boundary

A

Borderline between two neighbouring regions of different classes. Decision boundary is parallel to axes because test condition involves a single attribute at-a-time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Types of decision trees

A

Type of predicted variable (end node):

  1. Nominal (class) = majority of label of instances in node Classification trees.
  2. Numerical value = average value of instances in node Regression trees.
  3. Equation = regression equation with least squares fir Model trees.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Tree induction

A
  • Strategy
    + Split the records based on an attribute test that optimizes certain criterion (information gain).
  • Important issues
    + Determine how to split records
    + How to specify the attribute test condition?
    + How to determine the best split?
    + Determine when to stop splitting. This is also important to keep in mind when there is noice in the data. Data with noise doesn’t have pure nodes, so if you want to reach pure nodes, you’re modelling noise.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Measures for node impurity (a.k.a. when to stop splitting)

A
  1. Misclassification error.
  2. Entropy (this is the preferred option because in general this provides the smallest trees).
  3. Gini index
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Quality of a split

A
Objective: obtaining pure nodes, i.e., nodes that contain objects from a single class (if node is impure, take majority class as label).
- Measures for impurity
\+ Misclassification error - fraction of objects that are classified correctly and part incorrectly if we assign every object to the majority class in that node.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Splitting based on classification error

A

Classification error at a node t:
- Error(t) = 1 - maxP(i|t)
Measures misclassification error made by a node
- Maximum (1-1/nc), when records are equally distributed among all classes.
- Minimum (0.0) when all records belong to one class pure node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

For efficient computation

A

For each attribute:

  • Sort the attribute on values.
  • Linearly scan these values, each time updating the count matrix and computing entropy or index.
  • Choose the split position that has the least entropy or index.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Regression tree overfitting

A

Divide the x axis in parts in such a way that in each part the value of the node is equal to the average of the red points. By continuing splitting you can fit the red points arbitrarily well.

  • 2 splits
  • 4 splits
  • 6 splits -> very sensitive to noise.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Growing the tree

A

Using the training data:
- Split nodes base on impurity criterium.
- Until all leave nodes are pure (impurity = 0)
Problem: if the tree is too large, the model also contains noise of the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Underfitting

A

When model is too simple, both training and test errors are large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Overfitting

A

When model is too complex, training errors are small and test errors are large:
- Overfitting results in decision trees that are more complex than necessary.
- Training error no longer provides a good estimate of how well the tree will perform on previously unseen records.
- Solutions:
Occam’s Razor principle:
+ Given two models of similar generalization errors, one should prefer the simpler model over the more complex model.
+ For complex models, there is a greater chance that it was fitted accidentally by errors in data.
Possible solution:
- Include model complexity when evaluating a model.

17
Q

Stopping criteria

A
  • Stop splitting if information gain is too low < threshold.
  • Stop splitting if the error on the validation set goes up.
    Problem: in a later stage you might get higher gain or lower error.

2 systematic methods:

  • Quinlan in C4.5: reduced error pruning
  • Breiman in CART: cost complexity pruning
18
Q

Tree pruning

A
  1. Reduced error pruning: Make inner node into leaf node if error on validation set decreases, continue until no decrease possible.
  2. Cost complexity pruning: just grow the tree until impurity is zero, and then prune back using cost complexity measure.