Chapter 8: Evaluation of Classifiers Flashcards

1
Q

The CRISP Data Mining Process

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Precision of Classifier Performance

A
  • Holdout procedure - split data into two sets
    • E.g. 66% training, 33% testing (aka holdout procedure)
    • The more training instances, the better (after a certain size the performance might stagnate)
    • The more test data, the better the estimation of the error rate
  • Use k-fold cross validation
    • Split into k sets
    • Generate k models with k-1 sets (leave one set out)
    • Test each model on kth set
  • Use leave-one-out (jackknife)
    • Generate n models on (n-1) instances
    • Apply model to nth instance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Holdout Procedures

A
  • Holdout procedure:
    • Reserve some data for testing (usually ~1/3) [ select random}
    • Use remaining data for training
  • Plentiful data - no problem!
  • Common case: data set is limited
  • Problems:
    • Want both sets as large as possible
    • Want both sets to be representative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

“Smart” Holdout

A
  • Simple check: Are the proportions of classes about the same in each data set?
  • Stratified holdout
    • Guarantee that classes are (approximately) proportionally represented in the test and training set
  • Repeated holdout (in addition) Do with different sets and multiple time to get independent Test data
    • Randomly select holdout set several times and average the error rate estimates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cross Validation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Holdout w/ Cross-Validation

A
  • Fixed number of k partitions of the data (folds)
  • In turn: each partition is used for testing and the remaining
  • instances for training
    • Finally each instance is used for testing once
  • May use stratification
  • Standard practice: stratified tenfold cross-validation Error rate is estimated by taking the average of error rates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Leave-One-Out Holdout

A
  • n-Fold Cross-Validation
    • n instances are in the data set
    • Use all but one instance for training
    • Each iteration is evaluated by predicting the omitted instance
  • Advantages / Disadvantages
    • Maximum use of the data for training
    • Deterministic (no random sampling of test sets)
    • High computational cost
    • Non-stratified sample!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Comparing Mining Algorithms

A
  • Suppose we have two algorithms
    • Obtain two different models
    • Estimate the error rates for the two models
    • Compare estimates
      • e(1) < e(2)?
    • Select the better one
  • Problem?
    • Are there “significant” differences in the error rates?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Comparing Random Estimates

A
  • Estimated error rate is just an estimate (random)
  • Student’s paired t-test tells us whether the means of two samples are significantly different
  • Construct a t-test statistic
    • Need variance as well as point estimates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Counting the Costs of comparing classification errors

A
  • In practice, different types of classification errors often incur different costs Examples:
    • Predicting when customers leave a company for competitors (attrition/churn prognosis)
      • Much more costly to lose a valuable customer
      • Much less costly to act on a false customer
  • Loan decisions
  • Oil-slick detection
  • Fault diagnosis
  • Promotional mailing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Direct Marketing Paradigm

A
  • Find most likely prospects to contact
  • Not everybody needs to be contacted
  • Number of targets is usually much smaller than number of prospects
  • Typical applications (You are only looking for small part)
    • retailers, catalogues, direct mail (and e-mail)
    • customer acquisition, cross-sell, attrition prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Direct Marketing Evaluation

A
  • Accuracy on the entire dataset is not the right measure (in decision tree leaf node provides propability)
  • Approach
    • develop a target model
    • score all prospects and rank them by decreasing score
    • select top P% of prospects for action
  • How to decide what is the best selection?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Generating a Gain Curve

A
  • Instances are sorted according to their predicted probability:
  • In a gain curve
    • x axis is sample size
    • y axis is number of positives
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Gain Curve: Random List vs. Model-ranked List

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Lift Curve

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

ROC Curves

A
  • Differences to gain chart:
    • y axis shows percentage of true positives in sample (rather than absolute number):
      • TP=(#T predicted T)/#T
    • x axis shows percentage of false positives in sample (rather than sample size):
      • FP=(#F predicted T)/#F
      • provided by true but false
  • ROC curves are similar to gain curves
    • “ROC” stands for “receiver operating characteristic”
  • Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel
  • Go throught all sizes of a sample and plot FP vs. TP
17
Q

ROC Curves for Two Classifiers

A
18
Q

Comparison Studies of Classification Methods

A
  • Large number of classification techniques from the field of ML, NN and Statistics
  • Many empirical comparisons in the 80‘s and 90‘s with contradictory results
  • StatLog Project (mid 90‘s)
    • More than 20 methods and 20 datasets
    • Performance of methods depends very much on the data set
    • No general guidelines
    • Comparison study is advisable in special cases
  • State-of-the-practice
    • Comparisons of several methods on a particular data set
    • Sometimes even automated „mass modeling“
    • Method tanking :
19
Q

„Prime Candidates“ – Recommendations from StatLog

A
  • Following methods should be considered in real-world comparison studies
    • Logistic Regression (Discriminant Analysis)
    • Decision trees
    • K-Nearest Neighbour
    • Nonparametric statistical methods
    • Neural Networks
  • Decision trees and logistic regression are widely used in practice
    • High performance / low error rate
    • Speed of model building and classification
    • Easy to explain
    • as compared to NN or kNN
20
Q

A Real World Example of Classification: Campaign Management

A
  • Problem definition:
    • Marketing campaign for online billing
    • Selection of a few tousends from a few million customers, who are potential responders to a campaign
  • Questions
    • Probability of customer response?
    • Which customer attributes are relevant for characterizing a responder?
21
Q

Model Selection Criteria - best model

A
  • Can we find a single ‘best’ model / classifier?
  • Model selection criteria attempt to find a good compromise between:
    • The complexity of a model
    • Its prediction accuracy on the training data
  • Reasoning: a good model is a simple model that achieves high accuracy on the given data
  • Also known as Occam’s Razor: the best theory is the smallest one that describes all the facts
22
Q

Elegance vs. Errors

A
  • Theory 1: very simple, elegant theory that explains the data almost perfectly
  • Theory 2: significantly more complex theory that reproduces the data without mistakes
  • Theory 1 is probably preferable
  • Classical example: Kepler’s three laws on planetary motion
    • Less accurate than Copernicus’s latest refinement of the Ptolemaic theory of epicycles
23
Q

The MDL Principle

A
  • MDL principle is a model selection criterion
    • MDL stands for minimum description length
  • The description length is defined as:
    • space required to describe a theory
    • +
    • space required to describe the theory’s mistakes
  • In our case the theory is the classifier and the mistakes are the errors on the training data
  • Aim: we want a classifier with minimal DL
24
Q

Inductive Learning Systems

A
25
Q

Randomness vs. Regularity

A
  • 0110001101…
    • random string=incompressible=maximal information
  • 010101010101010101
    • regularity of repetition allows compression
    • If the training set is ‘representative’ then regularities in the training set will be representative for regularities in an unseen test set.
  • A lot of forms of learning can be described as data compression in the following sense:
26
Q

Kolmogorov Complexity

A
  • The Kolmogorov complexity (K) of a binary object is the length of the shortest program that generates this object on a universal Turing machine
    • Random strings are not compressible
    • A message with low Kolmogorov complexity is compressible
  • K as complexity measure is incomputable
    • So in practical applications it always needs to be approximated, e.g. by Lempel-Ziv (used in zip and unzip) compression or others
  • Ray Solomonoff founded the theory of universal inductive inference, which draws on Kolmogorov complexity. Kolmogorov and Solomonoff invented the concept in parallel.
27
Q

Kolmogorov Complexity

A
  • Let x be a finite length binary string and U a universal computer.
  • Let U(p) be the output of U when presented with program p.
  • The Kolmogorov complexity of string x is the minimal description length of x.
28
Q

Overview of LZ Algorithms

A
  • The Lempel-Ziv algorithm can be used as an approximation of the Kolmogorov complexity.
  • Demonstration
    • Take a simple alphabet containing only two letters, a and b
    • Create a sample stream of text
  • aaababbbaaabaaaaaaabaabb
29
Q

LZ Algorithms

A
  • Rule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far.
    • aaababbbaaabaaaaaaabaabb
    • a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb
30
Q

Indexing

A

Before compression, the pieces of text from the breaking- down process are indexed from 1 to n:

31
Q

LZ

A
  • Indices are used to number the pieces of data.
    • The empty string (start of text) has index 0.
    • The piece indexed by 1 is a. Thus a, together with the initial string, must be numbered Oa.
    • String 2, aa, will be numbered 1a, because it contains a, whose index is 1, and the new character a.
32
Q

Good Compression?

A
  • In a long string of text, the number of bits needed to transmit the coded information is small compared to the actual length of the text.
  • Example: 16 bits to transmit the code 2b instead of 24 bits (8 + 8 + 8) needed for the actual text aab.
33
Q

MDL-approach to Learning

A
  • Occam’s Razor
    • “Among equivalent models choose the simplest one.”
  • Minimum Description Length (MDL)
    • “Select model that describes data with minimal #bits.”
    • model = shortest program that outputs data
    • length of program = Kolmogorov Complexity
  • Learning = finding regularities = compression