Classification - Part 2 Flashcards

1
Q

What aspects are important for model evaluation?

A
  • Central question is: How good is a model at classifying unseen records?
  • There are Metrics for Model Evaluation (How to measure the performance of a model?)
  • There are Methods for Model Evaluation (How to obtain reliable estimates)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which focus have metrics for model evaluation?

A
  • They focus on the predictive capability of a model (rather than how much time to classify records)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the confusion matrix?

A
  • It counts the correct and false classifications
  • Counts are the basis for calculating different performance metricsPredicted Class (Y=1)
    Y N
    Y TP FN
    N FP TN

In case of credit card fraud, FN and FP would be unsatisfactory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the formula for accuracy?

A

(TP + TN) / (TP + TN + FP + FN)

correct predictions / all predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the formula for error rate?

A

1 - accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe the class imbalance problem

A
  • Sometimes, classes have very unequal frequency (Fraud detection: 98% of transaction OK, 2% fraud)
  • The class of interest is commonly called positive class and the remaining negative classes
# negative examples = 9990
# positive examples = 10
-> if model predicts all records negative, the accuracy is 99.9%
--> Accuracy is misleading because it does not detect any positive example
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you mitigate the class imbalance problem?

A
  • Use performance metrics that are biased towards the positive class by ignoring TN
  • Precision
  • Recall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the precision performance metric?

A
  • Number of correctly classified positive examples divided by number of predicted positive examples

p = TP / ( TP + FP)

Question: How many examples that are classified positive are actually positive?
-> False Alarm rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the recall performance metric?

A
  • Number of correctly classified positive examples divided by the actual positive examples

r = TP / (TP + FN)

Question: Which fraction of all positive examples is classified correctly

-> Detection rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In which case is precision and recall problematic?

A
  • Cases where the count of FP or FN is 0
    -> p = 100 % r = 1% for:
    1 99
    0 1000
    –> no negative example is classified wrong, one positive example is classified correct

Consequence:
We need a measure that
1. combines precision and recall and
2. is large if both values are large

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the F1-Measure

A
  • Combines precision and recall into one measure
  • It is the harmonic mean of precision and recall
    • Tends to be closer to the smaller of p and r
    • Thus, p and r must be large for a large F1

Formula:

(2pr) / (p +r)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the low threshold for the F1-Measure Graph

A
  • Low precision, high recall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the restrictive threshold for the F1-Measure graph?

A
  • High precision, low recall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What alternative performance metric can be used if you have domain knowledge?

A
  • Cost-Sensitive Model Evaluation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a ROC curve?

A
  • A graphical approach that displays the trade-off between detection rate (recall) and false alarm (precision)
  • ROC curves visualize the true positive rate and false positive rate in relation to the algorithms confidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is a ROC curve drawn?

A
  • Sort classifications according to confidence scores
  • Scan over all classifications:
    • right prediction: draw one step up
    • wrong prediction: draw one step to the right
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How to you interpret a ROC curve?

A
  • The steeper the better
  • Random guessing results in diagonal
  • Decent classification model should result in a curve above the diagonal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is to be considered to obtain a reliable estimate of the generalization performance (methods for model evaluation)

A
  • Never test a model on data that was used for training
    • That would not result in a reliable estimate of the performance on unseen data
  • Keep training and test set strictly separate
  • Which labeled records should be used for training and which for testing?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What data set splitting approaches do you know?

A
  • Holdout Method
  • Random Subsampling
  • Cross Validation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the learning curve describe?

A
  • How accuracy changes with growing training set size

-> If low model performance, get more training data (use labeled data rather for training than testing)
Problem: Labeling additional data is often expensive due to manual effort

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe the Holdout method.

A
  • Reserves a certain amount of the labeled data for testing and uses the remainder for training (20% / 80%)
  • For unbalanced datasets it is not representative (few or no records of minority class in training or test set)

-> Use stratified sample to apply random sampling

22
Q

What is stratified sampling?

A
  • Sample each class independently so that records of the minority class are present in each sample (training and test set)
23
Q

Describe the random subsampling method

A
  • Makes holdout more reliable by repeating the process with different subsamples (both training and test)
  • Each iteration random selection of records for training
  • Averages performance of iterations

Problem:

  • Some outliers might always end up in the test sets
  • Records that are important for learning might always be in test sets thus they can never be trained
24
Q

Explain cross validation

A
  • Avoids overlapping test sets
    Approach:
    1. Split data into k subsets of equal size
    2. Each subset is used for testing once and the remainder for training (every record is used for testing once)
25
Q

What are some common practical approaches for cross validation?

A
  • Use stratified sampling to generate subsets
  • k = 10 because from experience it delivers accurate estimate while still using as much data for training as possible
  • Very computational intensive, especially in combination with random forests
26
Q

When should you prefer the holdout method over cross-validation?

A
  • Labeled dataset is large (>5000 examples) and

- long computation time or exact replicability of results matters (for data science competitions)

27
Q

Which performance metrics should be used by default?

A
  • Accuracy
28
Q

Which performance metrics should be used if the interesting class is infrequent?

A
  • Precision
  • Recall
  • and F1
29
Q

How to increase the models performance?

A
  1. If imbalanced dataset, balance it. Oversample the number of positive examples.
  2. Optimize hyperparameters of the learning algorithm
  3. avoid overfitting
30
Q

What is rule-based classification?

A
  • Is a eager learning approach which delivers explainable results
  • Classify records by using a collection of “if.. then…” rules

Rule-based classifier = set of classification rules

31
Q

What is a classification rule?

A

Classification rule: Condition -> y

condition: conjunction of attribute tests
y: class label

32
Q

What is rule coverage?

A
  • Fraction of all records that satisfy the condition of a rule
  • Fraction of all records that are covered by the rule
33
Q

What is accuracy of a rule?

A
  • Fraction of covered records that are classified correctly
34
Q

What are the characteristics of rule-based classifiers?

A

Mutually Exclusive Rule Set:

  • the rules contained in the classifier are independent of each other
  • every record is covered by at most one rule

Exhaustive Rule Set:

  • classifier has exhaustive coverage if it accounts for every possible combination of attribute values
  • each record is covered by at least one rule
35
Q

How can you fix a rule set that is not mutually exclusive?

A

Solution 1: Ordered Rules

  • by accuracy
  • classify according to highest-ranked rule

Solution 2: Voting

  • all matching rules vote and assign majority class label
  • votes may be weighted by rule quality (accuracy)
36
Q

How can you fix a rule set that is not exhaustive?

A
  • Add default rule: () -> Y
37
Q

What methods for rule-based classifiers exist?

A
  1. Direct Method
    - Extract rules directly from data
    - e.g. RIPPER
  2. Indirect Method
    - Extract rules from other classification models
    - e.g. C4.5rules
38
Q

Explain the indirect method to derive rules from a decision tree

A
  • Generate a rule for every path from the root to one of the leave nodes
  • Rule set contains as much information as the tree

-> Generated rules are mutually exclusive and exhaustive!

39
Q

Explain the indirect method: C4.5 rules

A

It applies rule simplification to the rule set. This makes the rule set no longer mutually exclusive. Thus, need to apply ordered rule set or voting schemes.

Approach:

  1. extract rules from an unpruned decision tree
  2. for each rule:
  3. consider alternative rule by removing one of the conjuncts
  4. compare the pessimistic error rate for r against all r’s
  5. prune if one of the r’s has lower pessimistic error rate
  6. repeat until we can no longer improve generalization error
40
Q

Explain the direct method: RIPPER

A
  • learns a ordered rule set from training data

- approach depends on 2-class or multi-class problem

41
Q

Explain RIPPER for 2-class problem

A
  • Choose the less frequent class as positive class and the other as negative class
  • learn rule for the positive class
  • negative class will be default class
42
Q

Explain RIPPER for multi-class problem

A
  • Order classes according to increasing frequency
  • Learn the rule set for smallest class first, treat rest as negative class
  • repeat with next smallest class as positive class
43
Q

How does RIPPER use sequential covering?

A
  1. Start from a empty rule list
  2. Grow a rule that covers as many positive examples as possible
  3. remove training records covered by the rule
  4. Repeat step 2 and 3 till stopping criterion is met
44
Q

What are the aspects of sequential covering?

A
  1. Rule Growing
  2. Rule Pruning
  3. Instance Elimination
  4. Stopping Criterion
45
Q

Explain rule growing within the ripper algorithm

A
  1. Start with empty rule {} -> class
  2. Step by step add conjuncts that (based on FOIL’s information gain measure)
    a. improve accuracy of the rule
    b. rule covers many examples

Goal: Prefer rules with high accuracy and high support count

46
Q

Explain what rule pruning is within the RIPPER algorithm

A
  • Because of the stopping criterion, the learned rule is likely to overfit the data
  • > Rule is pruned afterwards using a validation dataset (similar to post-pruning of decision trees)
47
Q

How does the rule pruning procedure in RIPPER work?

A
  1. Remove one of the conjuncts in the rule
  2. compare error rate on validation dataset before and after pruning
  3. if error improves, prune the conjunct

Measure for pruning:
v= (p - n) / (p + n)

p: # positive examples covered by the rule in validation set
n: # negative examples covered by the rule in the validation set

48
Q

What is the goal of Rule Pruning in RIPPER?

A

Goal: Decrease generalization error of the rule

49
Q

Why do we need to remove positive instances in the RIPPER algorithm?

A
  • Otherwise the next rule is identical to the previous rule
50
Q

Why do we remove negative instances in the RIPPER algorithm?

A
  • To prevent underestimating accuracy of a rue
51
Q

What is the stopping criterion to add new rules to the rule set for RIPPER?

A
  • error rate of new rule on validation set must not exceed 50%
  • minimum description length should not increase more than d bits
52
Q

What are the advantages of rule-based classifiers

A
  • Easy to interpret for humans (eager learning)
  • Performance comparable to decision trees
  • Fast classification of unseen records
  • Well suited to handle imbalanced data sets, because they learn rules for minority class first