Chapter 6: Decision Trees Flashcards
1
Q
Decision Trees
A
- An internal node is a test on an attribute.
- A branch represents an outcome of the test, e.g., Color=red.
- A leaf node represents a class label or class label distribution.
- At each node, one attribute is chosen to split training examples into distinct classes as much as possible.
- A new case is classified by following a matching path to a leaf node
2
Q
Building Decision Tree
A
- Top-down tree construction (Recursive Algo ⇒ End up with huge tree. If model is over fitted means it does not generalize well )
- At start, all training examples are at the root.
- Partition the examples recursively by choosing one attribute each time.
- Bottom-up tree pruning
- Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases
3
Q
Choosing the Splitting Attribute
A
- At each node, available attributes are evaluated on the basis of separating the classes of the training examples.
- A “goodness” function is used for this purpose. Typical goodness functions:
- information gain (ID3/C4.5)
- information gain ratio
- gini index (CART)
- Pcrkct split criteria: Splits data in two sides - Yes and No. ⇒ clean subnodes
4
Q
A Criterion for Attribute Selection
A
- Which is the best attribute?
- The one which will result in the smallest tree
- Heuristic: choose the attribute that produces the “purest” nodes
- Popular impurity criterion: information gain
- Information gain increases with the average purity of the subsets that an attribute produces
- Strategy: choose attribute that results in greatest information gain
5
Q
Computing Information
A
- Information is measured in bits
- Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
- Entropy gives the information required in bits (this can involve fractions of bits!)
- Formula for computing the information entropy:
- Defines how ordered a system is
6
Q
Example: attribute “Outlook”
A
7
Q
Computing the Information Gain
A
8
Q
Continuing to Split
A
9
Q
The Final Decision Tree
A
10
Q
Wish List for a Purity Measure
A
- Properties we require from a purity measure:
- When node is pure, measure should be zero (=0)
- When impurity is maximal (i.e. all classes equally likely), measure should be maximal (e.g., 1 for boolean values)
- Multistage property: Info[2,3,4]=Info[2,7]+7/9 Info[3,4]
- Entropy is a function that satisfies these properties
11
Q
Entropy
A
- Entropy in general, describes the randomness of a system
- Entropy = 0 describes a perfectly ordered system
- The term was used in thermodynamics and and statistical mechanics „ A Mathematical Theory of Communication“ by Claude Shannon 1948 defines entropy and information theory
- „uncertainty about the contents of a message“
12
Q
Expected Information Gain
A
Gain(S,a) is the information gained adding a sub-tree (Reduction in number of bits needed to classify an instance)
Problems?
13
Q
Highly-Branching Attributes
A
- Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values
- Information gain is biased towards choosing attributes with a large number of values
- This may result in overfitting (selection of an attribute that is non-optimal for prediction)
14
Q
Split for ID Code Attribute
A
Entropy of split = 0 (since each leaf node is “pure”), having only one case. Information gain is maximal for ID code
15
Q
Gain Ratio
A
- Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes 4 ->Corrects ID problem
- Gain ratio takes number and size of branches into account when choosing an attribute
- It corrects the information gain by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to)