Lecture 8 - Performance measures Flashcards
Classify e-mails to spam vs inbox
- Given (labeled) examples of both document types
- Train a classifier to discriminate between these two
- During operation, use classifier to select destination folder for new email: Inbox or Spam folder?
Steps in classification
Step 1: Split data into train and test sets
Step 2: Build a model on a training set
Step 3: Evaluate on test set
Note on parameter tuning in classification steps
- It is important that the test data is not used in any way to create the classifier
- Some training schemes operate in two stages:
- Stage 1: build the basic structure
- Stage 2: optimise parameter setting
- The test data cannot be used for parameter tuning
- Proper procedure uses three sets: training data, validation data, and test data
- Validation data is used to optimise parameters
Making the most of the data
- Once evaluation is completed, all the data can be used to build the final classifier
- Generally, the larger the training data the better the classifier (but returns diminish)
- The larger the test data the more accurate the error estimate
Types of outcomes
- Building models using training data is called Supervised Learning
- Interested in predicting the outcome variable for new records
- Three main types of outcomes:
- a. Predicted numerical value, e.g., house price
- b. Predicted class membership, e.g., cancer or not
- c. Probability of class membership (for categorical outcome variable), e.g., Naive Bayes
Evaluating Predictive Performance - Generating numeric predictions
- Interested in models that have high predictive accuracy when applied to new records
- Models are trained on the training data
- Applied to the validation data and
- Measures of accuracy then use the prediction errors on that validation set
Prediction Accuracy measures
Prediction Accuracy Measures pt2
Prediction Accuracy Measures pt3
Lift Chart pt1
- Graphical way to assess predictive performance
- In some applications, we are not interested in predicting the outcome of value of each new record
- But the goal is to search for a subset of records that gives the highest cumulative predicted values
- Compares the model’s predictive performance to a baseline model that has no predictors
Lift Chart pt2
- In practice, costs are rarely known
- Decisions are made by comparing possible scenarios
- Example: promotional bailout to 1,000,000 households
- Mail to all with response rate being 0.1% (1,000)
- Consider a data mining tool that can identify a subset of 100,000 most promising households with response rate 0.4% (400)
- Responses are 40% ut cost cost is 10%
- → It might pay off to restrict to these 100,00
- The increase in response rate is called lift factor
- A lift chart allows a visual comparison
Judging Classifier Performance - i.e., categorical variables
- Misclasification is when a record belongs to one class but the model classifies it as a member of a different class
- A natural criterion for judging the performance of a classifier is the probability of making a misclassification error
Perfect classifier makes no errors! but the real world has “noise” therefore cannot construct perfect classifiers
Confusion / Classification matrix
- A matrix that summarises the correct and incorrect classifications that a classifier produced for a given dataset
- Rows and columns correspond to the predicted and true (actual) classes
- In practice, most accuracy measures are derived from this matrix
- Correct classifications: True Positive and True Negative
- Incorrect classifications: False positive - outcome incorrectly predicted as a yes / positive, False negative i.e., outcome incorrectly predicted as a no / negative
Type I and II errors
Type I Error → False Positive
- Predicted positive but that is incorrect
- Example: predicted that a man is pregnant, but actually he is not pregnant
Type II Error → False Negative
- Predicted negative but that is incorrect
- Example: predicted a woman is not pregnant when actually she is pregnant
Overall success rate / Accuracy
Number of correct classifications divided by the total number of classifications
ROC curves
- Developed in 1950s for signal detection theory to analyse noisy signals
- Characterise the trade-off between positive hits and false alarms
- ROC curve plots TP (on the y-axis) against FP (on the x-axis)
- Plots the true positive rate against the false positive rate for the different possible thresholds of a diagnostic test
Example ROC curve
Using ROC for Model Comparison
Limitation of Accuracy and Cost Matrix
Consider a 2-class problem: number of class 0 examples = 9990 and number of class 1 examples = 10
If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9%. Accuracy is misleading because model does not detect any class 1 example
Computing Cost of Classification
Multiclass prediction
Multiclass prediction - Kappa statistic
Precision and Recall
Precision and Recall formulas
Determining Recall is difficult
• Total number of items / records that belong to a particular class is sometimes not available
- Sample across the dataset and perform relevance judgment on these items
- Apply different models to the same dataset and then use the aggregate of relevant items as the total relevant set
F-measure (F1-measure)
Python code