Chapter 7: Pruning in Decision Trees Data Preparation Flashcards
Decision Tree Algorithms
- At each node, available attributes are evaluated on the basis of separating the classes of the training examples.
- A goodness function is used for this purpose.
- Typical goodness functions:
- information gain (ID3/C4.5)
- information gain ratio
- gini index (CART)
Computing Information
- Information is measured in bits
- Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
- Entropy gives the information required in bits (this can involve fractions of bits!)
- Formula for computing the entropy:
- entropy(p1…pn) = - p1 logp1 -…- np logpn
Industrial-Strength Algorithms
- For an algorithm to be useful in a wide range of real- world applications it must:
- Permit numeric attributes
- Allow missing values
- Be robust in the presence of noise
- Basic schemes need to be extended to fulfill these requirements
Pruning
- Prepruning tries to decide a priori when to stop creating subtrees
- Halt construction of decision tree early
- Use same measure as in determining attributes, e.g., halt if InfoGain < threshold
- Most frequent class becomes the leaf node
- This turns out to be fairly difficult to do well in practice
- due to “combination-lock effect”, i.e. two attributes individually have nothing to contribute but are powerful predictors when combined
- Postpruning simplifies an existing decision tree
- Construct complete decision tree
- Then prune it back
- Used in C4.5, CART
- Needs more runtime than prepruning
Postpruning
Subtree replacement replaces a subtree with a single leaf node (main method).
Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent

When to Prune a Tree?
- To determine if a node should be replaced, compare the error rate estimate for the node with the combined error rates of the children.
- Replace the node if its error rate is less than combined rates of its children.
- Prune only if it reduces the estimated error
- Error on the training data is NOT a useful estimator
- Use a hold-out set for pruning (“reduced-error pruning”)
- Limits data you can use for training
- C4.5’s method
- Derive confidence interval from training data
- Use a heuristic limit for the error rate, derived from the confidence interval for pruning
- Shaky statistical assumptions (because it is based on training data), but works well in practice

Bernoulli Process
- The Bernulli process is a discrete-time stochastic process consisting of a sequence of independent random variables (i.e., a sequence of Bernulli trials)
- Applied to situations where an event does either occur (success) or not occur (failure)
- Let the probability of occurrences be 𝑝, the probability of no occurrence be 𝑞 = 1 − 𝑝.
- Let the total number of independent trials be 𝑛, and the number of successes be 𝑘.
- The probability of 𝑘 successes in n trials is the probability mass function (pmf) of the Binomial Distribution B:

Central Limit Theorem Revisited
- The central limit theorem states that the standardized average of any population of i.i.d. random variables 𝑋𝑖 with mean 𝜇𝑋 and variance 𝜎2 is asymptotically ~𝑁(0,1), or
- Asymptotic Normality implies that 𝑃(𝑍 < 𝑧 )–> Φ(𝑧) a - n -> infinity, 𝑜𝑟 𝑃(𝑍 < 𝑧) equal Φ(𝑧)

Using the Confidence Interval of a Normal Distribution
- C4.5 uses a heuristic limit for the error rate, derived from the confidence interval of the error rate for pruning
- 𝑥% confidence interval [– 𝑧<= 𝑋<= 𝑧] for random variable with 0 mean is given by: Pr[-z
- With a symmetric distribution:
- Pr[z <= <=X z]=1-2xPr[X >= z]
Confidence Limits
- Confidence limits c for the standard normal distribution with 0 mean and a variance of 1:
- There is a 25% probability of X being > 0.69 Pr[0.69 X 0.69]
- To use this we have to reduce our random variable f to have 0 mean and unit variance

Transforming f
- You prune the tree stronger
- If c goes down=>z goes up and also p goes up
- If n goes down=>p goes up
- with p as an estimator for the error rate

C4.5’s Method
- Error estimate e for a node (:= upper bound of confidence interval):
- see pic
- If c = 25% then z = 0.69 (from normal distribution)
- f is the error on the training data
- n is the number of instances covered by the node
- Even if information gain increases, e might increase as well
- Error estimate for subtree is weighted sum of error estimates for all its leaves
C4.5 Example


From Trees to Rules
- Simple way: one rule for each leaf
- Often these rules are more complex than necessary
- C4.5rules:
- greedily prune conditions from each rule if this reduces its estimated error (as discussed above)
- Then
- look at each class in turn
- consider the rules for that class
- find a “good” subset (guided by Occam’s Razor)
- Finally, remove rules (greedily) if this decreases error on the training data
C4.5 and C4.5rules: Summary
- C4.5 decision tree algorithm has two important parameters
- Confidence value (default 25%): lower values incur heavier pruning
- Minimum number of instances in the two most popular branches (default 2)
- Classification rules
- C4.5rules slow for large and noisy datasets
- Commercial version C5.0rules uses a different pruning technique
- Much faster and a bit more accurate
Knowledge Discovery Process

Data Understanding: Quantity
- Number of instances (records)
- Rule of thumb: 5,000 or more desired (nice to have)
- if less, results are less reliable; use special methods (boosting, …)
- Number of attributes
- Rule of thumb: Start out with less than 50 attributes
- If many attributes, use attribute selection
- Number of targets
- Rule of thumb: >100 for each class
- if very unbalanced, use stratified sampling
Data Understanding
- Visualization
- Data summaries
- Attribute means
- Attribute variation
- Attribute relationships
Data Cleaning: Outline
- Convert data to a standard format (e.g., arff or csv)
- Unified date format
- Missing values
- Fix errors and outliers
- Convert nominal fields whose values have order to numeric
- Binning (i.e. discretization) of numeric data
Data Cleaning: Missing Values
- Missing data can appear in several forms:
- “0” “.” “999” “NA” …
- Standardize missing value code(s) Dealing with missing values:
- Ignore records with missing values
- Treat missing value as a separate value
- Imputation: fill in with mean or median values
Conversion: Ordered to Numeric
- Ordered attributes (e.g. Grade) can be converted to numbers preserving natural order, e.g.
- A 4.0
- A- 3.7
- B+3.3
- B 3.0
- Why is it important to preserve natural order?
- To allow meaningful comparisons, e.g. Grade > 3.5
Conversion: Nominal, Few Values
- Multi-valued, unordered attributes with small no. of values
- e.g. Color=Red, Orange, Yellow, …
- for each value v create a binary “flag” variable C_v , which is 1 if Color=v, 0 otherwise
Nominal, many values: Ignore ID-like fields whose values are unique for each record

Data Cleaning: Discretization
- Discretization reduces the number of values for a continuous attribute
- Why?
- Some methods can only use nominal data
- E.g., in ID3, Apriori, most versions of Naïve Bayes, CHAID
- Helpful if data needs to be sorted frequently (e.g., when constructing a decision tree)
- Some methods that handle numerical attributes assume normal distribution which is not always appropriate
- Some methods can only use nominal data
- Discretization is useful for generating a summary of data
- Also called “binning”
Discretization: Equal-Width

Discretization: Equal-Width May Produce Clumping

Discretization: Equal-Frequency

Discretization: Class Dependent (Supervised Discretization)
- Treating numerical attributes as nominal discards the potentially valuable ordering information
- Alternative: Transform the 𝑘 nominal values to 𝑘 − 1 binary attributes. The (𝑖 − 1)-th binary attribute indicates whether the discretized attribute is less than 𝑖

Discretization Considerations
- Equal Width is simplest, good for many classes
- can fail miserably for unequal distributions
- Equal Frequency gives better results
- In practice, “almost-equal” height binning is used which avoids clumping and gives more intuitive breakpoints
- Also used for clustering (when there is no “class”)
- Class-dependent can be better for classification
- Decision trees build discretization on the fly
- Naïve Bayes requires initial discretization
- Other methods exist …
Unbalanced Target Distribution
- Sometimes, classes have very unequal frequency
- Churn prediction: 97% stay, 3% churn (in a month)
- Medical diagnosis: 90% healthy, 10% disease
- eCommerce: 99% don’t buy, 1% buy
- Security: >99.99% of Germans are not terrorists
- Similar situation with multiple classes
- Majority class classifier can be 97% correct, but useless
Building Balanced Train Sets

Attribute Selection
- Aka feature / variable / field selection
- If there are too many attributes, select a subset that is most relevant.
- Remove redundant and/or irrelevant attributes
- Rule of thumb – keep top 50 attributes
Reasons for Attribute Selection
- Simpler model
- More transparent
- Easier to interpret
- Faster model induction
- What about overall time?
- What about the accuracy?
- The addition of irrelevant attributes can negatively impact the performance of kNN, DT, regressions, clustering, etc.
- Experiments: Adding a random attribute to a DT learner can decrease the accuracy by 5-10% (Witten & Frank, 2005)
- Instance-based methods are particularly weak in the presence of irrelevant attributes (compared with only a few instances)
- The addition of irrelevant attributes can negatively impact the performance of kNN, DT, regressions, clustering, etc.
Attribute Selection Heuristics
- Stepwise forward selection
- Start with empty attribute set
- Add “best” of attributes (e.g. using entropy)
- Add “best” of remaining attributes
- Repeat. Take the top n (a certain threshold value)
- Stepwise backward selection
- Start with entire attribute set
- Remove “worst” of attributes
- Repeat until n are left.
Attribute Selection
- Using entropy for attribute selection
- Calculate information gain of each attribute
- Select the n attributes with the highest information gain
- Experiences
- Lead to local, not necessarily global optima
- Nevertheless, they perform reasonably well
- Backward selection performed better than forward selection
- Forward selection leads to smaller attribute sets and easier models
- Attribute selection is actually a search problem
- Want to select subset of attributes giving most accurate model

Basic Approaches to Attribute Selection
- Remove attributes with no or little variability
- Examine the number of distinct attribute values
- Rule of thumb: remove a field where almost all values are the same (e.g. null), except possibly in minp % or less of all records.
- Remove false predictors (“leakers”)
- False predictors are fields correlated to target behavior, which describe events that happen at the same time or after
- E.g. student final grade, for the task of predicting whether the student passed the course
Text Mining
- Many text databases exist in practice
- News articles
- Research papers
- Books
- Digital libraries
- E-mail messages
- Web pages
Text Mining Process
- Text preprocessing
- Syntactic/Semantic text analysis
- Features Generation
- Bag of words
- Features Selection
- Simple counting
- Statistics
- Text/Data Mining
- Classification of documents
- Clustering of documents
- Analyzing result

Text Attributes
- For each word, generate a boolean attribute indicating whether the text of the instance contains the word
- Alternatively, the attribute may indicate the number of the word’s occurrences
- If “too many” words, keep only the more frequent ones
- Stemming algorithms reduce words to their root, e.g. guarantee, guaranteeing, guaranteed
Data Mining: Technical Memo Example: Titles
- c1 Human machine interface for Lab ABC computer applications
- c2 A survey of user opinion of computer system response time
- c3 The EPS user interface management system
- c4 System and human system engineering testing of EPS
- c5 Relation of user-perceived response time to error measurement
- m1 The generation of random, binary, unordered trees
- m2 The intersection graph of paths in trees
- m3 Graph minors IV: Widths of trees and well-quasi-ordering
- m4 Graph minors: A survey

Data mining: “Search” versus “Discover”
