Lesson 5 - Decision Trees Flashcards

1
Q

What are decision trees?

A
  • tree representation of rules and decision functions
    • can be reduced to a series of if-then rules (comprehensible to humans)
  • medical, biological and financial applications
  • Composed from:
    • root, starting point
    • inner nodes, specify some attribute
    • branches, possible value (discrete) o range of values (continuous) that the parent node can assume
    • leaf nodes, classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How to transform decision trees into Boolean functions?

A
  • paths from the root to a leaf are conjunctions of costraints on values
  • paths that lead to a same classification codify disjunctions
  • both define a series of Disjunctive Normal Form, one for each class
    • (A=1 and B=3) or (C=2) or (C=4 and D=1)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When are decision trees are particularly useful?

A
  • instances are attribute-values pairs
    • fixed set of attributes and values
    • few possible attributes values
    • discrete attributes values [can be continuos]
  • target discrete output values (2 or + values)
  • target can be described by disjunctions of boolean functions
  • training with noise and missing values
  • Decision trees learning algorithms are efficient, still relevant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ID3 learning algorithm

A
  • 1986, most popular
    • top-down, greedy search approach (space of possible DTs)
  • Create a root node T
  • If examples in S are all of the same class c, return the leaf node T
    with label c
  • If A is empty, return the leaf node T with label the majority class in S
  • Select a ∈ A, the optimal attribute in A (details later);
  • Partition S according to the values that a can take: Sa=v1, . . . , Sa=vm where m is the number of distinct values of the attribute a
  • Return the tree T having sub-trees the trees obtained by recursively calling ID3(Sa=vj, A − a), for each j
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does ID3 choose the optimal attribute?

A
  • uses entropy and information gain
    • entropy measures the degree of impurity of a sample
    • information gain is the expected reduction of the entropy obtained when
      partitioning the samples according to the value of attribute a
  • alternative criterias are:
    • Cross-Entropy
    • Gini Index
    • Misclassification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the problems of decision trees?

A
  • information gain favors too much those attributes that can assume many possible values
  • overfitting can occur
    • setting a minimal number of examples to accept in leaf nodes or limiting depth of the tree (pruning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Entropy

A
  • E(S) = -∑(c=1 ->m) p_c log⁡(p_c)
  • p_c=|S_c |/|S|
    • S samples
    • S_c subset of S with class c
    • C number of classes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Information Gain

A
  • G(S,a)=E(S)−∑(v∈V(a) ) (|S_(v=a) |/|S|) E(S_(v=a))
    • S samples
    • a attribute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the inductive bias in decision trees?

A
  • can be seen as a search in a hypothesis space
  • shorter trees are preferred to larger trees
  • attributes with high informative gain are closer to the root
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is the value for real-valued attributes selected?

A
  • select the threshold that provides the maximal information gain for the attribute
    • always localized between two values with a different classification
    • computed only for a set of values
  • compared with the gains of other attributes to choose the optimal attribute
  • real-valued attributes can be used multiple times on the same path (discrete no)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How are missing values treated?

A
  • Different solutions based on samples:
    • use the most frequent value for the attribute
    • use the most frequent value for the attribute in same category samples
    • consider values and probability of occurrence, substitute the sample with each possible value multiplied for probability of occurrence (multi-fraction product probabilities)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Reduced Error Pruning

A
  • Pruning -> replace subtree with leaf labelled as most frequen label
    • divide training set in training and validation
    • evaluate each inner node with validation while simulating that the subtree is pruned
    • prune best performance node
    • repeat until performance gets worse
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Post-Rule Pruning

A
  • Transform decision tree into a set of rules and perform rule pruning
    • rule generated for each path (if-then) [changing hypothesis space]
    • for each rule a performance is estimated (rule as classifier)
    • preconditions that lead to an increase in performance (validation or statistical test) are removed, greedy approach
    • rules sorted in descending order of performance [classification follows order, first precondition match is used, or default -> most frequent class]
  • Generally improves performance and is better than Reduced-Error Pruning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly