Decision Trees Flashcards
What is the ID3 algorithm?
Split (node, {example}):
- A
- For each value of A create child node
- Split training {example}’s to child nodes
- For each subset Split (child_node, {subset}) until subset is pure
What is the definition of Entropy?
H(S) = -p(+) log2 p(+) - p(-) log2 p(-)
Where p(+) and p(+) are % positive/negative examples in S
How can we interpret entropy?
How many bits needed to tell if X is positive or negative
How do we compute the expected drop in entropy (gain)?
What does gain tell us?
Average entropy after the split, biased to larger subsets
What is infomation gain?
Difference between the entopy before the split and after the split
What does infomation gain tell you?
How much more certain you are after a split
What happens if you run ID3 to completion?
All subsets will be pure
How to Decision trees overfit?
Running ID3 to completion, end up with lots of singleton subsets, not alot of confidence in estimates (only 1 traning example).
How can we avoid overfitting (decision trees)?
- Create full DT
- For each node
- Remove
- Measure performance on validation set
- Remove node that results in greatest improvement
- Repeat until accuracy drops
How is split entropy defined?
What do we use split entropy for
Normalize infomation gain by how fine grained the split is
Definition of gain ratio?
What does GainRation penalize?
Attributes with many values
What is the problem with infomation gain?
Biased towards attributes with many values (they create lots of small pure subsets)