Decision Tree Flashcards
Entropy
Measure of disorder that can be applied to set.
Disorder corresponds to how mixed(impure) the segment is WRT these properties of interest.
Entropy formula
-p1 x log(p1) - p2 log(p2) - ..
Information Gain measures
measures the change in entropy due to any new information being added.
Information Gain formula
– IG(parent,children)=entropy(parent)–[p(c1)×entropy(c1)+p(c2)× entropy(c2) + ⋯+ p(ck) × entropy(ck)]
Classification trees
Each interior node in the tree contains a test of an attribute, with each branch from the node representing a distinct value of the attribute.
Decision tree - Basic algorithm (a greedy algorithm)
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
When do decision tree stop?
There are no remaining attributes for further partitioning.
All sample for a given node belong to the same class.
Information Gain drawback
biased towards multivalued attributes
Gain ratio drawback
tends to prefer unbalanced splits in which one partition is much smaller than the others
Gini Index
- biased to multivalued attributes
- has difficulty when # of classes is large
- tends to favor tests that result in equal‐sized partitions and purity in both partitions