Chapter 8: Evaluation of Classifiers Flashcards
The CRISP Data Mining Process

Precision of Classifier Performance
- Holdout procedure - split data into two sets
- E.g. 66% training, 33% testing (aka holdout procedure)
- The more training instances, the better (after a certain size the performance might stagnate)
- The more test data, the better the estimation of the error rate
- Use k-fold cross validation
- Split into k sets
- Generate k models with k-1 sets (leave one set out)
- Test each model on kth set
- Use leave-one-out (jackknife)
- Generate n models on (n-1) instances
- Apply model to nth instance
Holdout Procedures
- Holdout procedure:
- Reserve some data for testing (usually ~1/3) [ select random}
- Use remaining data for training
- Plentiful data - no problem!
- Common case: data set is limited
- Problems:
- Want both sets as large as possible
- Want both sets to be representative
“Smart” Holdout
- Simple check: Are the proportions of classes about the same in each data set?
- Stratified holdout
- Guarantee that classes are (approximately) proportionally represented in the test and training set
- Repeated holdout (in addition) Do with different sets and multiple time to get independent Test data
- Randomly select holdout set several times and average the error rate estimates
Cross Validation

Holdout w/ Cross-Validation
- Fixed number of k partitions of the data (folds)
- In turn: each partition is used for testing and the remaining
- instances for training
- Finally each instance is used for testing once
- May use stratification
- Standard practice: stratified tenfold cross-validation Error rate is estimated by taking the average of error rates

Leave-One-Out Holdout
- n-Fold Cross-Validation
- n instances are in the data set
- Use all but one instance for training
- Each iteration is evaluated by predicting the omitted instance
- Advantages / Disadvantages
- Maximum use of the data for training
- Deterministic (no random sampling of test sets)
- High computational cost
- Non-stratified sample!
Comparing Mining Algorithms
- Suppose we have two algorithms
- Obtain two different models
- Estimate the error rates for the two models
- Compare estimates
- e(1) < e(2)?
- Select the better one
- Problem?
- Are there “significant” differences in the error rates?
Comparing Random Estimates
- Estimated error rate is just an estimate (random)
- Student’s paired t-test tells us whether the means of two samples are significantly different
- Construct a t-test statistic
- Need variance as well as point estimates

Counting the Costs of comparing classification errors
- In practice, different types of classification errors often incur different costs Examples:
- Predicting when customers leave a company for competitors (attrition/churn prognosis)
- Much more costly to lose a valuable customer
- Much less costly to act on a false customer
- Predicting when customers leave a company for competitors (attrition/churn prognosis)
- Loan decisions
- Oil-slick detection
- Fault diagnosis
- Promotional mailing

Direct Marketing Paradigm
- Find most likely prospects to contact
- Not everybody needs to be contacted
- Number of targets is usually much smaller than number of prospects
- Typical applications (You are only looking for small part)
- retailers, catalogues, direct mail (and e-mail)
- customer acquisition, cross-sell, attrition prediction
- …
Direct Marketing Evaluation
- Accuracy on the entire dataset is not the right measure (in decision tree leaf node provides propability)
- Approach
- develop a target model
- score all prospects and rank them by decreasing score
- select top P% of prospects for action
- How to decide what is the best selection?
Generating a Gain Curve
- Instances are sorted according to their predicted probability:
- In a gain curve
- x axis is sample size
- y axis is number of positives

Gain Curve: Random List vs. Model-ranked List

Lift Curve

ROC Curves
- Differences to gain chart:
- y axis shows percentage of true positives in sample (rather than absolute number):
- TP=(#T predicted T)/#T
- x axis shows percentage of false positives in sample (rather than sample size):
- FP=(#F predicted T)/#F
- provided by true but false
- y axis shows percentage of true positives in sample (rather than absolute number):
- ROC curves are similar to gain curves
- “ROC” stands for “receiver operating characteristic”
- Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel
- Go throught all sizes of a sample and plot FP vs. TP

ROC Curves for Two Classifiers

Comparison Studies of Classification Methods
- Large number of classification techniques from the field of ML, NN and Statistics
- Many empirical comparisons in the 80‘s and 90‘s with contradictory results
- StatLog Project (mid 90‘s)
- More than 20 methods and 20 datasets
- Performance of methods depends very much on the data set
- No general guidelines
- Comparison study is advisable in special cases
- State-of-the-practice
- Comparisons of several methods on a particular data set
- Sometimes even automated „mass modeling“
- Method tanking :
„Prime Candidates“ – Recommendations from StatLog
- Following methods should be considered in real-world comparison studies
- Logistic Regression (Discriminant Analysis)
- Decision trees
- K-Nearest Neighbour
- Nonparametric statistical methods
- Neural Networks
- Decision trees and logistic regression are widely used in practice
- High performance / low error rate
- Speed of model building and classification
- Easy to explain
- as compared to NN or kNN
A Real World Example of Classification: Campaign Management
- Problem definition:
- Marketing campaign for online billing
- Selection of a few tousends from a few million customers, who are potential responders to a campaign
- Questions
- Probability of customer response?
- Which customer attributes are relevant for characterizing a responder?
Model Selection Criteria - best model
- Can we find a single ‘best’ model / classifier?
- Model selection criteria attempt to find a good compromise between:
- The complexity of a model
- Its prediction accuracy on the training data
- Reasoning: a good model is a simple model that achieves high accuracy on the given data
- Also known as Occam’s Razor: the best theory is the smallest one that describes all the facts
Elegance vs. Errors
- Theory 1: very simple, elegant theory that explains the data almost perfectly
- Theory 2: significantly more complex theory that reproduces the data without mistakes
- Theory 1 is probably preferable
- Classical example: Kepler’s three laws on planetary motion
- Less accurate than Copernicus’s latest refinement of the Ptolemaic theory of epicycles
The MDL Principle
- MDL principle is a model selection criterion
- MDL stands for minimum description length
- The description length is defined as:
- space required to describe a theory
- +
- space required to describe the theory’s mistakes
- In our case the theory is the classifier and the mistakes are the errors on the training data
- Aim: we want a classifier with minimal DL
Inductive Learning Systems

Randomness vs. Regularity
- 0110001101…
- random string=incompressible=maximal information
- 010101010101010101
- regularity of repetition allows compression
- If the training set is ‘representative’ then regularities in the training set will be representative for regularities in an unseen test set.
- A lot of forms of learning can be described as data compression in the following sense:

Kolmogorov Complexity
- The Kolmogorov complexity (K) of a binary object is the length of the shortest program that generates this object on a universal Turing machine
- Random strings are not compressible
- A message with low Kolmogorov complexity is compressible
- K as complexity measure is incomputable
- So in practical applications it always needs to be approximated, e.g. by Lempel-Ziv (used in zip and unzip) compression or others
- Ray Solomonoff founded the theory of universal inductive inference, which draws on Kolmogorov complexity. Kolmogorov and Solomonoff invented the concept in parallel.
Kolmogorov Complexity
- Let x be a finite length binary string and U a universal computer.
- Let U(p) be the output of U when presented with program p.
- The Kolmogorov complexity of string x is the minimal description length of x.

Overview of LZ Algorithms
- The Lempel-Ziv algorithm can be used as an approximation of the Kolmogorov complexity.
- Demonstration
- Take a simple alphabet containing only two letters, a and b
- Create a sample stream of text
- aaababbbaaabaaaaaaabaabb
LZ Algorithms
- Rule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far.
- aaababbbaaabaaaaaaabaabb
- a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb
Indexing
Before compression, the pieces of text from the breaking- down process are indexed from 1 to n:

LZ
- Indices are used to number the pieces of data.
- The empty string (start of text) has index 0.
- The piece indexed by 1 is a. Thus a, together with the initial string, must be numbered Oa.
- String 2, aa, will be numbered 1a, because it contains a, whose index is 1, and the new character a.

Good Compression?
- In a long string of text, the number of bits needed to transmit the coded information is small compared to the actual length of the text.
- Example: 16 bits to transmit the code 2b instead of 24 bits (8 + 8 + 8) needed for the actual text aab.
MDL-approach to Learning
- Occam’s Razor
- “Among equivalent models choose the simplest one.”
- Minimum Description Length (MDL)
- “Select model that describes data with minimal #bits.”
- model = shortest program that outputs data
- length of program = Kolmogorov Complexity
- Learning = finding regularities = compression