Classification - Part 2 Flashcards

1
Q

What aspects are important for model evaluation?

A
  • Central question is: How good is a model at classifying unseen records?
  • There are Metrics for Model Evaluation (How to measure the performance of a model?)
  • There are Methods for Model Evaluation (How to obtain reliable estimates)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which focus have metrics for model evaluation?

A
  • They focus on the predictive capability of a model (rather than how much time to classify records)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the confusion matrix?

A
  • It counts the correct and false classifications
  • Counts are the basis for calculating different performance metricsPredicted Class (Y=1)
    Y N
    Y TP FN
    N FP TN

In case of credit card fraud, FN and FP would be unsatisfactory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the formula for accuracy?

A

(TP + TN) / (TP + TN + FP + FN)

correct predictions / all predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the formula for error rate?

A

1 - accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe the class imbalance problem

A
  • Sometimes, classes have very unequal frequency (Fraud detection: 98% of transaction OK, 2% fraud)
  • The class of interest is commonly called positive class and the remaining negative classes
# negative examples = 9990
# positive examples = 10
-> if model predicts all records negative, the accuracy is 99.9%
--> Accuracy is misleading because it does not detect any positive example
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you mitigate the class imbalance problem?

A
  • Use performance metrics that are biased towards the positive class by ignoring TN
  • Precision
  • Recall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the precision performance metric?

A
  • Number of correctly classified positive examples divided by number of predicted positive examples

p = TP / ( TP + FP)

Question: How many examples that are classified positive are actually positive?
-> False Alarm rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the recall performance metric?

A
  • Number of correctly classified positive examples divided by the actual positive examples

r = TP / (TP + FN)

Question: Which fraction of all positive examples is classified correctly

-> Detection rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In which case is precision and recall problematic?

A
  • Cases where the count of FP or FN is 0
    -> p = 100 % r = 1% for:
    1 99
    0 1000
    –> no negative example is classified wrong, one positive example is classified correct

Consequence:
We need a measure that
1. combines precision and recall and
2. is large if both values are large

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the F1-Measure

A
  • Combines precision and recall into one measure
  • It is the harmonic mean of precision and recall
    • Tends to be closer to the smaller of p and r
    • Thus, p and r must be large for a large F1

Formula:

(2pr) / (p +r)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the low threshold for the F1-Measure Graph

A
  • Low precision, high recall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the restrictive threshold for the F1-Measure graph?

A
  • High precision, low recall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What alternative performance metric can be used if you have domain knowledge?

A
  • Cost-Sensitive Model Evaluation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a ROC curve?

A
  • A graphical approach that displays the trade-off between detection rate (recall) and false alarm (precision)
  • ROC curves visualize the true positive rate and false positive rate in relation to the algorithms confidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is a ROC curve drawn?

A
  • Sort classifications according to confidence scores
  • Scan over all classifications:
    • right prediction: draw one step up
    • wrong prediction: draw one step to the right
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How to you interpret a ROC curve?

A
  • The steeper the better
  • Random guessing results in diagonal
  • Decent classification model should result in a curve above the diagonal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is to be considered to obtain a reliable estimate of the generalization performance (methods for model evaluation)

A
  • Never test a model on data that was used for training
    • That would not result in a reliable estimate of the performance on unseen data
  • Keep training and test set strictly separate
  • Which labeled records should be used for training and which for testing?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What data set splitting approaches do you know?

A
  • Holdout Method
  • Random Subsampling
  • Cross Validation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the learning curve describe?

A
  • How accuracy changes with growing training set size

-> If low model performance, get more training data (use labeled data rather for training than testing)
Problem: Labeling additional data is often expensive due to manual effort

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe the Holdout method.

A
  • Reserves a certain amount of the labeled data for testing and uses the remainder for training (20% / 80%)
  • For unbalanced datasets it is not representative (few or no records of minority class in training or test set)

-> Use stratified sample to apply random sampling

22
Q

What is stratified sampling?

A
  • Sample each class independently so that records of the minority class are present in each sample (training and test set)
23
Q

Describe the random subsampling method

A
  • Makes holdout more reliable by repeating the process with different subsamples (both training and test)
  • Each iteration random selection of records for training
  • Averages performance of iterations

Problem:

  • Some outliers might always end up in the test sets
  • Records that are important for learning might always be in test sets thus they can never be trained
24
Q

Explain cross validation

A
  • Avoids overlapping test sets
    Approach:
    1. Split data into k subsets of equal size
    2. Each subset is used for testing once and the remainder for training (every record is used for testing once)
25
What are some common practical approaches for cross validation?
- Use stratified sampling to generate subsets - k = 10 because from experience it delivers accurate estimate while still using as much data for training as possible - Very computational intensive, especially in combination with random forests
26
When should you prefer the holdout method over cross-validation?
- Labeled dataset is large (>5000 examples) and | - long computation time or exact replicability of results matters (for data science competitions)
27
Which performance metrics should be used by default?
- Accuracy
28
Which performance metrics should be used if the interesting class is infrequent?
- Precision - Recall - and F1
29
How to increase the models performance?
1. If imbalanced dataset, balance it. Oversample the number of positive examples. 2. Optimize hyperparameters of the learning algorithm 3. avoid overfitting
30
What is rule-based classification?
- Is a eager learning approach which delivers explainable results - Classify records by using a collection of "if.. then..." rules Rule-based classifier = set of classification rules
31
What is a classification rule?
Classification rule: Condition -> y condition: conjunction of attribute tests y: class label
32
What is rule coverage?
- Fraction of all records that satisfy the condition of a rule - Fraction of all records that are covered by the rule
33
What is accuracy of a rule?
- Fraction of covered records that are classified correctly
34
What are the characteristics of rule-based classifiers?
Mutually Exclusive Rule Set: - the rules contained in the classifier are independent of each other - every record is covered by at most one rule Exhaustive Rule Set: - classifier has exhaustive coverage if it accounts for every possible combination of attribute values - each record is covered by at least one rule
35
How can you fix a rule set that is not mutually exclusive?
Solution 1: Ordered Rules - by accuracy - classify according to highest-ranked rule Solution 2: Voting - all matching rules vote and assign majority class label - votes may be weighted by rule quality (accuracy)
36
How can you fix a rule set that is not exhaustive?
- Add default rule: () -> Y
37
What methods for rule-based classifiers exist?
1. Direct Method - Extract rules directly from data - e.g. RIPPER 2. Indirect Method - Extract rules from other classification models - e.g. C4.5rules
38
Explain the indirect method to derive rules from a decision tree
- Generate a rule for every path from the root to one of the leave nodes - Rule set contains as much information as the tree -> Generated rules are mutually exclusive and exhaustive!
39
Explain the indirect method: C4.5 rules
It applies rule simplification to the rule set. This makes the rule set no longer mutually exclusive. Thus, need to apply ordered rule set or voting schemes. Approach: 1. extract rules from an unpruned decision tree 2. for each rule: 1. consider alternative rule by removing one of the conjuncts 2. compare the pessimistic error rate for r against all r's 3. prune if one of the r's has lower pessimistic error rate 4. repeat until we can no longer improve generalization error
40
Explain the direct method: RIPPER
- learns a ordered rule set from training data | - approach depends on 2-class or multi-class problem
41
Explain RIPPER for 2-class problem
- Choose the less frequent class as positive class and the other as negative class - learn rule for the positive class - negative class will be default class
42
Explain RIPPER for multi-class problem
- Order classes according to increasing frequency - Learn the rule set for smallest class first, treat rest as negative class - repeat with next smallest class as positive class
43
How does RIPPER use sequential covering?
1. Start from a empty rule list 2. Grow a rule that covers as many positive examples as possible 3. remove training records covered by the rule 4. Repeat step 2 and 3 till stopping criterion is met
44
What are the aspects of sequential covering?
1. Rule Growing 2. Rule Pruning 3. Instance Elimination 4. Stopping Criterion
45
Explain rule growing within the ripper algorithm
1. Start with empty rule {} -> class 2. Step by step add conjuncts that (based on FOIL's information gain measure) a. improve accuracy of the rule b. rule covers many examples Goal: Prefer rules with high accuracy and high support count
46
Explain what rule pruning is within the RIPPER algorithm
- Because of the stopping criterion, the learned rule is likely to overfit the data - > Rule is pruned afterwards using a validation dataset (similar to post-pruning of decision trees)
47
How does the rule pruning procedure in RIPPER work?
1. Remove one of the conjuncts in the rule 2. compare error rate on validation dataset before and after pruning 3. if error improves, prune the conjunct Measure for pruning: v= (p - n) / (p + n) p: # positive examples covered by the rule in validation set n: # negative examples covered by the rule in the validation set
48
What is the goal of Rule Pruning in RIPPER?
Goal: Decrease generalization error of the rule
49
Why do we need to remove positive instances in the RIPPER algorithm?
- Otherwise the next rule is identical to the previous rule
50
Why do we remove negative instances in the RIPPER algorithm?
- To prevent underestimating accuracy of a rue
51
What is the stopping criterion to add new rules to the rule set for RIPPER?
- error rate of new rule on validation set must not exceed 50% - minimum description length should not increase more than d bits
52
What are the advantages of rule-based classifiers
- Easy to interpret for humans (eager learning) - Performance comparable to decision trees - Fast classification of unseen records - Well suited to handle imbalanced data sets, because they learn rules for minority class first