Lesson 5 - Decision Trees Flashcards
1
Q
What are decision trees?
A
- tree representation of rules and decision functions
- can be reduced to a series of if-then rules (comprehensible to humans)
- medical, biological and financial applications
- Composed from:
- root, starting point
- inner nodes, specify some attribute
- branches, possible value (discrete) o range of values (continuous) that the parent node can assume
- leaf nodes, classification
2
Q
How to transform decision trees into Boolean functions?
A
- paths from the root to a leaf are conjunctions of costraints on values
- paths that lead to a same classification codify disjunctions
- both define a series of Disjunctive Normal Form, one for each class
- (A=1 and B=3) or (C=2) or (C=4 and D=1)
3
Q
When are decision trees are particularly useful?
A
- instances are attribute-values pairs
- fixed set of attributes and values
- few possible attributes values
- discrete attributes values [can be continuos]
- target discrete output values (2 or + values)
- target can be described by disjunctions of boolean functions
- training with noise and missing values
- Decision trees learning algorithms are efficient, still relevant
4
Q
ID3 learning algorithm
A
- 1986, most popular
- top-down, greedy search approach (space of possible DTs)
- Create a root node T
- If examples in S are all of the same class c, return the leaf node T
with label c - If A is empty, return the leaf node T with label the majority class in S
- Select a ∈ A, the optimal attribute in A (details later);
- Partition S according to the values that a can take: Sa=v1, . . . , Sa=vm where m is the number of distinct values of the attribute a
- Return the tree T having sub-trees the trees obtained by recursively calling ID3(Sa=vj, A − a), for each j
5
Q
How does ID3 choose the optimal attribute?
A
- uses entropy and information gain
- entropy measures the degree of impurity of a sample
- information gain is the expected reduction of the entropy obtained when
partitioning the samples according to the value of attribute a
- alternative criterias are:
- Cross-Entropy
- Gini Index
- Misclassification
6
Q
What are the problems of decision trees?
A
- information gain favors too much those attributes that can assume many possible values
- overfitting can occur
- setting a minimal number of examples to accept in leaf nodes or limiting depth of the tree (pruning)
7
Q
Entropy
A
- E(S) = -∑(c=1 ->m) p_c log(p_c)
- p_c=|S_c |/|S|
- S samples
- S_c subset of S with class c
- C number of classes
8
Q
Information Gain
A
- G(S,a)=E(S)−∑(v∈V(a) ) (|S_(v=a) |/|S|) E(S_(v=a))
- S samples
- a attribute
9
Q
What are the inductive bias in decision trees?
A
- can be seen as a search in a hypothesis space
- shorter trees are preferred to larger trees
- attributes with high informative gain are closer to the root
10
Q
How is the value for real-valued attributes selected?
A
- select the threshold that provides the maximal information gain for the attribute
- always localized between two values with a different classification
- computed only for a set of values
- compared with the gains of other attributes to choose the optimal attribute
- real-valued attributes can be used multiple times on the same path (discrete no)
11
Q
How are missing values treated?
A
- Different solutions based on samples:
- use the most frequent value for the attribute
- use the most frequent value for the attribute in same category samples
- consider values and probability of occurrence, substitute the sample with each possible value multiplied for probability of occurrence (multi-fraction product probabilities)
12
Q
Reduced Error Pruning
A
- Pruning -> replace subtree with leaf labelled as most frequen label
- divide training set in training and validation
- evaluate each inner node with validation while simulating that the subtree is pruned
- prune best performance node
- repeat until performance gets worse
13
Q
Post-Rule Pruning
A
- Transform decision tree into a set of rules and perform rule pruning
- rule generated for each path (if-then) [changing hypothesis space]
- for each rule a performance is estimated (rule as classifier)
- preconditions that lead to an increase in performance (validation or statistical test) are removed, greedy approach
- rules sorted in descending order of performance [classification follows order, first precondition match is used, or default -> most frequent class]
- Generally improves performance and is better than Reduced-Error Pruning