Decision Trees Flashcards
What is a decision tree?
A decision support tool which uses a tree like graph used to model decisions and possible consequences
What does an internal node represent?
A feature. For example, ‘Outlook’
What does a branch correspond to?
A feature value. For example, ‘rain’
What does a leaf represent?
A classification
What is the algorithm for creating a decision tree?
- Split objects into subsets
- Are they pure
- Yes, STOP
- No, repeat
What do we mean by a subset being pure?
All of the items in the set give the same output.
i.e all examples give no
After a split, what do we want to be more certain about?
More about the yes/no decision.
We want the subsets to be more pure
What does entropy measure?
The uncertainty within a dataset
What is the formula for entropy?
∑ − P(i) log2 ( P(i) )
Sum of all separate parts of the set
Where P(i) is the percentage of i in the set
What does an entropy of 1 tell us about a set?
There is an even split of all i in the set
What does an entropy of 0 tell us about a set?
There is only one type of i in the set
What does information gain measure?
The effectiveness of a feature in classifying training data
What is the formula for information gain
Entropy (S) - ∑ | S(v) | / |S| Entropy( S(v) )
What does the ID3 Algorithm do?
It deicides which feature is best to split on recursively, using the information gain of each of the possible features until there are no more features
What is the outline of the ID3 Algorithm
Create root node
A = Feature that has highest information gain
set a as Root
For each possible value v of A:
- Add new branch corresponding A -> v
- let S(v) be the subset of examples in S with A=v
- if S(v) is empty: add leaf node with most common value of Label in S
- else: below this branch add a subtree
repeat
What is the formal definition of overfitting?
A given hypothesis h overfits training data if there is an alternative hypothesis h’ such that
The error of the training data of the first hypothesis is less than the error of the training data of the alternate hypothesis
and
The error of real data for the first hypothesis is greater than the error of real data for the alternate hypothesis
When we train a decision tree, what is there the danger of?
The danger that the tree memorises the training set, as opposed to learning the general concepts in the data
Overfitting
How can we avoid overfitting?
- Do not split data when it is not statistically significant
- Validation methods
- Grow a full tree, then prune branches that overfit
- Reduced Error Pruning (REPTree)
- Rule Post Pruning (RPP)
What is the algorithm for Reduced Error Pruning?
Split data into training and validation set
Do until further pruning is harmful:
- Evaluate impact on validation set of pruning each possible node (and those below)
- Greedily remove the one that improves validation set accuracy
What is the rule for Rule Post Pruning?
Grow the tree to overfit to the training data
- Convert tree to equivalent set of rules: One for each root to leaf path
- Prune each rule independently by removing preconditions if they do not worsen the rule’s accuracy
- Sort final rules by accuracy and employ them in this sequence
Estimate accuracy using the validation or training set
What are two examples of alternate splitting criteria?
- Gain Ratio
- Gini Index
What are the advantages of Gain Ratio as a splitting criteria?
- More sensitivity to how a feature splits data
- Discourages the selection of features with many values but have uniform distribution
What are the advantages of the Gini Index?
- Favours larger partitions
- Uses squared proportion of classes: Less sensitive to noise
What is the optimal Gini Index number?
0.0 - Low Gini index implies this is a good feature to split on
How are Random Forests constructed?
Grow k different decision trees
- Pick a random subset of training objects
- Grow a full tree from these objects (NO PRUNING)
- When splitting choose random features
- Compute gain based upon the random features rather than the whole set
Repeat for all k trees
How do we test a data object X with Random Forests?
Classify X using each of the trees
Use majority voting to get the class of X
When are some instances that we should use decision trees?
- Instances that can be described by feature value pairs
- Targets are discrete valued
- Problems which can be described with disjunction hypothesis: This and That gives y
- Data which may be noisy
- Data which may contain missing feature values
When are some instances that we shouldn’t use decision trees?
- Not ideal for real-valued decisions
- Problems that are not easy to express e.g. XOR
- “Sparse” data as overfitting can be a problem