Lecture 8 - Performance measures Flashcards

1
Q

Classification

A

Classify e-mails to spam vs inbox

  • Given (labeled) examples of both document types
  • Train a classifier to discriminate between these two
  • During operation, use classifier to select destination folder for new email: Inbox or Spam folder?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Steps in classification

A

Step 1: Split data into train and test sets
Step 2: Build a model on a training set
Step 3: Evaluate on test set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Note on parameter tuning in classification steps

A
  • It is important that the test data is not used in any way to create the classifier
  • Some training schemes operate in two stages:
    • Stage 1: build the basic structure
    • Stage 2: optimise parameter setting
  • The test data cannot be used for parameter tuning
  • Proper procedure uses three sets: training data, validation data, and test data
    • Validation data is used to optimise parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Making the most of the data

A
  • Once evaluation is completed, all the data can be used to build the final classifier
  • Generally, the larger the training data the better the classifier (but returns diminish)
  • The larger the test data the more accurate the error estimate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of outcomes

A
  • Building models using training data is called Supervised Learning
  • Interested in predicting the outcome variable for new records
  • Three main types of outcomes:
    • a. Predicted numerical value, e.g., house price
    • b. Predicted class membership, e.g., cancer or not
    • c. Probability of class membership (for categorical outcome variable), e.g., Naive Bayes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Evaluating Predictive Performance - Generating numeric predictions

A
  • Interested in models that have high predictive accuracy when applied to new records
  • Models are trained on the training data
  • Applied to the validation data and
  • Measures of accuracy then use the prediction errors on that validation set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Prediction Accuracy measures

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Prediction Accuracy Measures pt2

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Prediction Accuracy Measures pt3

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Lift Chart pt1

A
  • Graphical way to assess predictive performance
  • In some applications, we are not interested in predicting the outcome of value of each new record
  • But the goal is to search for a subset of records that gives the highest cumulative predicted values
  • Compares the model’s predictive performance to a baseline model that has no predictors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Lift Chart pt2

A
  • In practice, costs are rarely known
  • Decisions are made by comparing possible scenarios
  • Example: promotional bailout to 1,000,000 households
    • Mail to all with response rate being 0.1% (1,000)
    • Consider a data mining tool that can identify a subset of 100,000 most promising households with response rate 0.4% (400)
    • Responses are 40% ut cost cost is 10%
    • → It might pay off to restrict to these 100,00
  • The increase in response rate is called lift factor
  • A lift chart allows a visual comparison
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Judging Classifier Performance - i.e., categorical variables

A
  • Misclasification is when a record belongs to one class but the model classifies it as a member of a different class
  • A natural criterion for judging the performance of a classifier is the probability of making a misclassification error

Perfect classifier makes no errors! but the real world has “noise” therefore cannot construct perfect classifiers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Confusion / Classification matrix

A
  • A matrix that summarises the correct and incorrect classifications that a classifier produced for a given dataset
  • Rows and columns correspond to the predicted and true (actual) classes
  • In practice, most accuracy measures are derived from this matrix
  • Correct classifications: True Positive and True Negative
  • Incorrect classifications: False positive - outcome incorrectly predicted as a yes / positive, False negative i.e., outcome incorrectly predicted as a no / negative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Type I and II errors

A

Type I Error → False Positive

  • Predicted positive but that is incorrect
  • Example: predicted that a man is pregnant, but actually he is not pregnant

Type II Error → False Negative

  • Predicted negative but that is incorrect
  • Example: predicted a woman is not pregnant when actually she is pregnant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Overall success rate / Accuracy

A

Number of correct classifications divided by the total number of classifications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

ROC curves

A
  • Developed in 1950s for signal detection theory to analyse noisy signals
  • Characterise the trade-off between positive hits and false alarms
  • ROC curve plots TP (on the y-axis) against FP (on the x-axis)
  • Plots the true positive rate against the false positive rate for the different possible thresholds of a diagnostic test
16
Q

Example ROC curve

A
17
Q

Using ROC for Model Comparison

A
18
Q

Limitation of Accuracy and Cost Matrix

A

Consider a 2-class problem: number of class 0 examples = 9990 and number of class 1 examples = 10

If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9%. Accuracy is misleading because model does not detect any class 1 example

19
Q

Computing Cost of Classification

A
20
Q

Multiclass prediction

A
21
Q

Multiclass prediction - Kappa statistic

A
22
Q

Precision and Recall

A
23
Q

Precision and Recall formulas

A
24
Q

Determining Recall is difficult

A

• Total number of items / records that belong to a particular class is sometimes not available

Solutions:

  • Sample across the dataset and perform relevance judgment on these items
  • Apply different models to the same dataset and then use the aggregate of relevant items as the total relevant set
25
Q

F-measure (F1-measure)

A
26
Q

Python code

A