Chapter 7: Pruning in Decision Trees Data Preparation Flashcards
1
Q
Decision Tree Algorithms
A
- At each node, available attributes are evaluated on the basis of separating the classes of the training examples.
- A goodness function is used for this purpose.
- Typical goodness functions:
- information gain (ID3/C4.5)
- information gain ratio
- gini index (CART)
2
Q
Computing Information
A
- Information is measured in bits
- Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
- Entropy gives the information required in bits (this can involve fractions of bits!)
- Formula for computing the entropy:
- entropy(p1…pn) = - p1 logp1 -…- np logpn
3
Q
Industrial-Strength Algorithms
A
- For an algorithm to be useful in a wide range of real- world applications it must:
- Permit numeric attributes
- Allow missing values
- Be robust in the presence of noise
- Basic schemes need to be extended to fulfill these requirements
4
Q
Pruning
A
- Prepruning tries to decide a priori when to stop creating subtrees
- Halt construction of decision tree early
- Use same measure as in determining attributes, e.g., halt if InfoGain < threshold
- Most frequent class becomes the leaf node
- This turns out to be fairly difficult to do well in practice
- due to “combination-lock effect”, i.e. two attributes individually have nothing to contribute but are powerful predictors when combined
- Postpruning simplifies an existing decision tree
- Construct complete decision tree
- Then prune it back
- Used in C4.5, CART
- Needs more runtime than prepruning
5
Q
Postpruning
A
Subtree replacement replaces a subtree with a single leaf node (main method).
Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent
6
Q
When to Prune a Tree?
A
- To determine if a node should be replaced, compare the error rate estimate for the node with the combined error rates of the children.
- Replace the node if its error rate is less than combined rates of its children.
- Prune only if it reduces the estimated error
- Error on the training data is NOT a useful estimator
- Use a hold-out set for pruning (“reduced-error pruning”)
- Limits data you can use for training
- C4.5’s method
- Derive confidence interval from training data
- Use a heuristic limit for the error rate, derived from the confidence interval for pruning
- Shaky statistical assumptions (because it is based on training data), but works well in practice
7
Q
Bernoulli Process
A
- The Bernulli process is a discrete-time stochastic process consisting of a sequence of independent random variables (i.e., a sequence of Bernulli trials)
- Applied to situations where an event does either occur (success) or not occur (failure)
- Let the probability of occurrences be 𝑝, the probability of no occurrence be 𝑞 = 1 − 𝑝.
- Let the total number of independent trials be 𝑛, and the number of successes be 𝑘.
- The probability of 𝑘 successes in n trials is the probability mass function (pmf) of the Binomial Distribution B:
8
Q
Central Limit Theorem Revisited
A
- The central limit theorem states that the standardized average of any population of i.i.d. random variables 𝑋𝑖 with mean 𝜇𝑋 and variance 𝜎2 is asymptotically ~𝑁(0,1), or
- Asymptotic Normality implies that 𝑃(𝑍 < 𝑧 )–> Φ(𝑧) a - n -> infinity, 𝑜𝑟 𝑃(𝑍 < 𝑧) equal Φ(𝑧)
9
Q
Using the Confidence Interval of a Normal Distribution
A
- C4.5 uses a heuristic limit for the error rate, derived from the confidence interval of the error rate for pruning
- 𝑥% confidence interval [– 𝑧<= 𝑋<= 𝑧] for random variable with 0 mean is given by: Pr[-z
- With a symmetric distribution:
- Pr[z <= <=X z]=1-2xPr[X >= z]
10
Q
Confidence Limits
A
- Confidence limits c for the standard normal distribution with 0 mean and a variance of 1:
- There is a 25% probability of X being > 0.69 Pr[0.69 X 0.69]
- To use this we have to reduce our random variable f to have 0 mean and unit variance
11
Q
Transforming f
A
- You prune the tree stronger
- If c goes down=>z goes up and also p goes up
- If n goes down=>p goes up
- with p as an estimator for the error rate
12
Q
C4.5’s Method
A
- Error estimate e for a node (:= upper bound of confidence interval):
- see pic
- If c = 25% then z = 0.69 (from normal distribution)
- f is the error on the training data
- n is the number of instances covered by the node
- Even if information gain increases, e might increase as well
- Error estimate for subtree is weighted sum of error estimates for all its leaves
13
Q
C4.5 Example
A
14
Q
From Trees to Rules
A
- Simple way: one rule for each leaf
- Often these rules are more complex than necessary
- C4.5rules:
- greedily prune conditions from each rule if this reduces its estimated error (as discussed above)
- Then
- look at each class in turn
- consider the rules for that class
- find a “good” subset (guided by Occam’s Razor)
- Finally, remove rules (greedily) if this decreases error on the training data
15
Q
C4.5 and C4.5rules: Summary
A
- C4.5 decision tree algorithm has two important parameters
- Confidence value (default 25%): lower values incur heavier pruning
- Minimum number of instances in the two most popular branches (default 2)
- Classification rules
- C4.5rules slow for large and noisy datasets
- Commercial version C5.0rules uses a different pruning technique
- Much faster and a bit more accurate
16
Q
Knowledge Discovery Process
A