Lecture 12 - Decision Tree Induction Part 2 Flashcards
What is the pseudocode for decision tree induction?
FUNCTION buildDecTree(examples,atts)
Create node N if necessary; //starting as a node, ending as a tree
IF examples are all in same class THEN RETURN N labelled with that class;
IF atts is empty THEN RETURN N labelled with modal example class;
bestAtt = chooseBestAtt(examples,atts);
label N with bestAtt;
FOR each value a i of bestAtt //each branch from node N
s i = subset of examples with bestAtt = a i;
IF s i is not empty THEN
newAtts = atts – bestAtt;
subtree = buildDecTree(s i,newAtts); //recursive
attach subtree as child of N;
ELSE
Create leaf node L;
Label L with modal example class;
attach L as child of N;
RETURN N;
How is the best attribute usually chosen for decision tree induction?
Information Gain
Explain the pseudocode for decision induction
Take the training set
Work out information gain for each attribute against the target attribute
Highest info gain is the best attribute to sort by
Examine the subtrees from separating by that attribute
If all the values of the target attribute are the same in the subtree, that subtree is replaced by a leaf with that value
Otherwise run the whole procedure again, unless there are no more attributes to sort by in which case choose the most frequently occuring value of the target attribute
What are 3 issues with decision tree induction?
1) Inconsistent data
2) Numeric Attributes
3) Overfitting
Why is inconsistent data a problem in decision tree induction?
How can it be solved?
May often have no more attributes available to generate subtree with
Easiest method to solve this is use modal value
Why are numeric attribute values a problem in decision tree induction?
How can the issue be alleviated?
With numeric attribute values there is a very large number of possible values - making massive trees
Easy solution is to divide into ranges, e.g. 1-5, 6-10, 11-15 so reducing the number of possible values
What is overfitting?
When a model starts to model random noise on top of the real data.
What is overfitting specifically in relation to decision induction, and why does it occur?
How can it be alleviated?
When the decision tree models the sample set rather than the whole population and so takes into account peculiarities of the training set which might not be true of the whole population.
Can be alleviated by:
Pre pruning
Post Pruning
Increasing number of high quality samples