Holdout procedure: Reserve some data for testing (usually ~1/3) [ select random} Use remaining data for training Plentiful data - no problem! Common case: data set is limited Problems: Want both sets as large as possible Want both sets to be representative

Simple check: Are the proportions of classes about the same in each data set? Stratified holdout Guarantee that classes are (approximately) proportionally represented in the test and training set Repeated holdout (in addition) Do with different sets and multiple time to get independent Test data Randomly select holdout set several times and average the error rate estimates

Chapter 8: Evaluation of Classifiers Flashcards by Andreas Hein

The CRISP Data Mining Process

How well did you know this?

Not at all

Perfectly

Precision of Classifier Performance

Holdout procedure - split data into two sets
- E.g. 66% training, 33% testing (aka holdout procedure)
- The more training instances, the better (after a certain size the performance might stagnate)
- The more test data, the better the estimation of the error rate
Use k-fold cross validation
- Split into k sets
- Generate k models with k-1 sets (leave one set out)
- Test each model on k^th set
Use leave-one-out (jackknife)
- Generate n models on (n-1) instances
- Apply model to nth instance

How well did you know this?

Not at all

Perfectly

Holdout Procedures

Holdout procedure:
- Reserve some data for testing (usually ~1/3) [ select random}
- Use remaining data for training
Plentiful data - no problem!
Common case: data set is limited
Problems:
- Want both sets as large as possible
- Want both sets to be representative

How well did you know this?

Not at all

Perfectly

“Smart” Holdout

Simple check: Are the proportions of classes about the same in each data set?
Stratified holdout
- Guarantee that classes are (approximately) proportionally represented in the test and training set
Repeated holdout (in addition) Do with different sets and multiple time to get independent Test data
- Randomly select holdout set several times and average the error rate estimates

How well did you know this?

Not at all

Perfectly

Cross Validation

How well did you know this?

Not at all

Perfectly

Holdout w/ Cross-Validation

Fixed number of k partitions of the data (folds)
In turn: each partition is used for testing and the remaining
instances for training
- Finally each instance is used for testing once
May use stratification
Standard practice: stratified tenfold cross-validation Error rate is estimated by taking the average of error rates

How well did you know this?

Not at all

Perfectly

Leave-One-Out Holdout

n-Fold Cross-Validation
- n instances are in the data set
- Use all but one instance for training
- Each iteration is evaluated by predicting the omitted instance
Advantages / Disadvantages
- Maximum use of the data for training
- Deterministic (no random sampling of test sets)
- High computational cost
- Non-stratified sample!

How well did you know this?

Not at all

Perfectly

Comparing Mining Algorithms

Suppose we have two algorithms
- Obtain two different models
- Estimate the error rates for the two models
- Compare estimates
  - e⁽¹⁾ < e⁽²⁾?
- Select the better one
Problem?
- Are there “significant” differences in the error rates?

How well did you know this?

Not at all

Perfectly

Comparing Random Estimates

Estimated error rate is just an estimate (random)
Student’s paired t-test tells us whether the means of two samples are significantly different
Construct a t-test statistic
- Need variance as well as point estimates

How well did you know this?

Not at all

Perfectly

Counting the Costs of comparing classification errors

In practice, different types of classification errors often incur different costs Examples:
- Predicting when customers leave a company for competitors (attrition/churn prognosis)
  - Much more costly to lose a valuable customer
  - Much less costly to act on a false customer
Loan decisions
Oil-slick detection
Fault diagnosis
Promotional mailing

How well did you know this?

Not at all

Perfectly

Direct Marketing Paradigm

Find most likely prospects to contact
Not everybody needs to be contacted
Number of targets is usually much smaller than number of prospects
Typical applications (You are only looking for small part)
- retailers, catalogues, direct mail (and e-mail)
- customer acquisition, cross-sell, attrition prediction
- …

How well did you know this?

Not at all

Perfectly

Direct Marketing Evaluation

Accuracy on the entire dataset is not the right measure (in decision tree leaf node provides propability)
Approach
- develop a target model
- score all prospects and rank them by decreasing score
- select top P% of prospects for action
How to decide what is the best selection?

How well did you know this?

Not at all

Perfectly

Generating a Gain Curve

Instances are sorted according to their predicted probability:
In a gain curve
- x axis is sample size
- y axis is number of positives

How well did you know this?

Not at all

Perfectly

Gain Curve: Random List vs. Model-ranked List

How well did you know this?

Not at all

Perfectly

Lift Curve

How well did you know this?

Not at all

Perfectly

ROC Curves

Study These Flashcards

Differences to gain chart:
- y axis shows percentage of true positives in sample (rather than absolute number):
  - TP=(#T predicted T)/#T
- x axis shows percentage of false positives in sample (rather than sample size):
  - FP=(#F predicted T)/#F
  - provided by true but false
ROC curves are similar to gain curves
- “ROC” stands for “receiver operating characteristic”
Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel
Go throught all sizes of a sample and plot FP vs. TP

ROC Curves for Two Classifiers

Study These Flashcards

Comparison Studies of Classification Methods

Study These Flashcards

Large number of classification techniques from the field of ML, NN and Statistics
Many empirical comparisons in the 80‘s and 90‘s with contradictory results
StatLog Project (mid 90‘s)
- More than 20 methods and 20 datasets
- Performance of methods depends very much on the data set
- No general guidelines
- Comparison study is advisable in special cases
State-of-the-practice
- Comparisons of several methods on a particular data set
- Sometimes even automated „mass modeling“
- Method tanking :

„Prime Candidates“ – Recommendations from StatLog

Study These Flashcards

Following methods should be considered in real-world comparison studies
- Logistic Regression (Discriminant Analysis)
- Decision trees
- K-Nearest Neighbour
- Nonparametric statistical methods
- Neural Networks
Decision trees and logistic regression are widely used in practice
- High performance / low error rate
- Speed of model building and classification
- Easy to explain
- as compared to NN or kNN

A Real World Example of Classification: Campaign Management

Study These Flashcards

Problem definition:
- Marketing campaign for online billing
- Selection of a few tousends from a few million customers, who are potential responders to a campaign
Questions
- Probability of customer response?
- Which customer attributes are relevant for characterizing a responder?

Model Selection Criteria - best model

Study These Flashcards

Can we find a single ‘best’ model / classifier?
Model selection criteria attempt to find a good compromise between:
- The complexity of a model
- Its prediction accuracy on the training data
Reasoning: a good model is a simple model that achieves high accuracy on the given data
Also known as Occam’s Razor: the best theory is the smallest one that describes all the facts

Elegance vs. Errors

Study These Flashcards

Theory 1: very simple, elegant theory that explains the data almost perfectly
Theory 2: significantly more complex theory that reproduces the data without mistakes
Theory 1 is probably preferable
Classical example: Kepler’s three laws on planetary motion
- Less accurate than Copernicus’s latest refinement of the Ptolemaic theory of epicycles

The MDL Principle

Study These Flashcards

MDL principle is a model selection criterion
- MDL stands for minimum description length
The description length is defined as:
- space required to describe a theory
- +
- space required to describe the theory’s mistakes
In our case the theory is the classifier and the mistakes are the errors on the training data
Aim: we want a classifier with minimal DL

Inductive Learning Systems

Study These Flashcards

Randomness vs. Regularity

* 0110001101... * random string=incompressible=maximal information * 010101010101010101 * regularity of repetition allows compression * If the training set is ‘representative’ then regularities in the training set will be representative for regularities in an unseen test set. * A lot of forms of learning can be described as data compression in the following sense:

Kolmogorov Complexity

* The Kolmogorov complexity (K) of a binary object is the length of the shortest program that generates this object on a universal Turing machine * Random strings are not compressible * A message with low Kolmogorov complexity is compressible * K as complexity measure is incomputable * So in practical applications it always needs to be approximated, e.g. by Lempel-Ziv (used in zip and unzip) compression or others * Ray Solomonoff founded the theory of universal inductive inference, which draws on Kolmogorov complexity. Kolmogorov and Solomonoff invented the concept in parallel.

Kolmogorov Complexity

* Let x be a finite length binary string and U a universal computer. * Let U(p) be the output of U when presented with program p. * The Kolmogorov complexity of string x is the minimal description length of x.

Overview of LZ Algorithms

* The Lempel-Ziv algorithm can be used as an approximation of the Kolmogorov complexity. * Demonstration * Take a simple alphabet containing only two letters, a and b * Create a sample stream of text * aaababbbaaabaaaaaaabaabb

LZ Algorithms

* Rule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far. * aaababbbaaabaaaaaaabaabb * a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb

Indexing

Before compression, the pieces of text from the breaking- down process are indexed from 1 to n:

* Indices are used to number the pieces of data. * The empty string (start of text) has index 0. * The piece indexed by 1 is a. Thus a, together with the initial string, must be numbered Oa. * String 2, aa, will be numbered 1a, because it contains a, whose index is 1, and the new character a.

Good Compression?

* In a long string of text, the number of bits needed to transmit the coded information is small compared to the actual length of the text. * Example: 16 bits to transmit the code 2b instead of 24 bits (8 + 8 + 8) needed for the actual text aab.

MDL-approach to Learning

* Occam’s Razor * “Among equivalent models choose the simplest one.” * Minimum Description Length (MDL) * “Select model that describes data with minimal #bits.” * model = shortest program that outputs data * length of program = Kolmogorov Complexity * Learning = finding regularities = compression

Chapter 8: Evaluation of Classifiers Flashcards

(33 cards)