Classification - Part 2 Flashcards
What aspects are important for model evaluation?
- Central question is: How good is a model at classifying unseen records?
- There are Metrics for Model Evaluation (How to measure the performance of a model?)
- There are Methods for Model Evaluation (How to obtain reliable estimates)
Which focus have metrics for model evaluation?
- They focus on the predictive capability of a model (rather than how much time to classify records)
What is the confusion matrix?
- It counts the correct and false classifications
- Counts are the basis for calculating different performance metricsPredicted Class (Y=1)
Y N
Y TP FN
N FP TN
In case of credit card fraud, FN and FP would be unsatisfactory.
What is the formula for accuracy?
(TP + TN) / (TP + TN + FP + FN)
correct predictions / all predictions
What is the formula for error rate?
1 - accuracy
Describe the class imbalance problem
- Sometimes, classes have very unequal frequency (Fraud detection: 98% of transaction OK, 2% fraud)
- The class of interest is commonly called positive class and the remaining negative classes
# negative examples = 9990 # positive examples = 10 -> if model predicts all records negative, the accuracy is 99.9% --> Accuracy is misleading because it does not detect any positive example
How can you mitigate the class imbalance problem?
- Use performance metrics that are biased towards the positive class by ignoring TN
- Precision
- Recall
What is the precision performance metric?
- Number of correctly classified positive examples divided by number of predicted positive examples
p = TP / ( TP + FP)
Question: How many examples that are classified positive are actually positive?
-> False Alarm rate
What is the recall performance metric?
- Number of correctly classified positive examples divided by the actual positive examples
r = TP / (TP + FN)
Question: Which fraction of all positive examples is classified correctly
-> Detection rate
In which case is precision and recall problematic?
- Cases where the count of FP or FN is 0
-> p = 100 % r = 1% for:
1 99
0 1000
–> no negative example is classified wrong, one positive example is classified correct
Consequence:
We need a measure that
1. combines precision and recall and
2. is large if both values are large
Explain the F1-Measure
- Combines precision and recall into one measure
- It is the harmonic mean of precision and recall
- Tends to be closer to the smaller of p and r
- Thus, p and r must be large for a large F1
Formula:
(2pr) / (p +r)
What is the low threshold for the F1-Measure Graph
- Low precision, high recall
What is the restrictive threshold for the F1-Measure graph?
- High precision, low recall
What alternative performance metric can be used if you have domain knowledge?
- Cost-Sensitive Model Evaluation
What is a ROC curve?
- A graphical approach that displays the trade-off between detection rate (recall) and false alarm (precision)
- ROC curves visualize the true positive rate and false positive rate in relation to the algorithms confidence
How is a ROC curve drawn?
- Sort classifications according to confidence scores
- Scan over all classifications:
- right prediction: draw one step up
- wrong prediction: draw one step to the right
How to you interpret a ROC curve?
- The steeper the better
- Random guessing results in diagonal
- Decent classification model should result in a curve above the diagonal
What is to be considered to obtain a reliable estimate of the generalization performance (methods for model evaluation)
- Never test a model on data that was used for training
- That would not result in a reliable estimate of the performance on unseen data
- Keep training and test set strictly separate
- Which labeled records should be used for training and which for testing?
What data set splitting approaches do you know?
- Holdout Method
- Random Subsampling
- Cross Validation
What does the learning curve describe?
- How accuracy changes with growing training set size
-> If low model performance, get more training data (use labeled data rather for training than testing)
Problem: Labeling additional data is often expensive due to manual effort
Describe the Holdout method.
- Reserves a certain amount of the labeled data for testing and uses the remainder for training (20% / 80%)
- For unbalanced datasets it is not representative (few or no records of minority class in training or test set)
-> Use stratified sample to apply random sampling
What is stratified sampling?
- Sample each class independently so that records of the minority class are present in each sample (training and test set)
Describe the random subsampling method
- Makes holdout more reliable by repeating the process with different subsamples (both training and test)
- Each iteration random selection of records for training
- Averages performance of iterations
Problem:
- Some outliers might always end up in the test sets
- Records that are important for learning might always be in test sets thus they can never be trained
Explain cross validation
- Avoids overlapping test sets
Approach:
1. Split data into k subsets of equal size
2. Each subset is used for testing once and the remainder for training (every record is used for testing once)
What are some common practical approaches for cross validation?
- Use stratified sampling to generate subsets
- k = 10 because from experience it delivers accurate estimate while still using as much data for training as possible
- Very computational intensive, especially in combination with random forests
When should you prefer the holdout method over cross-validation?
- Labeled dataset is large (>5000 examples) and
- long computation time or exact replicability of results matters (for data science competitions)
Which performance metrics should be used by default?
- Accuracy
Which performance metrics should be used if the interesting class is infrequent?
- Precision
- Recall
- and F1
How to increase the models performance?
- If imbalanced dataset, balance it. Oversample the number of positive examples.
- Optimize hyperparameters of the learning algorithm
- avoid overfitting
What is rule-based classification?
- Is a eager learning approach which delivers explainable results
- Classify records by using a collection of “if.. then…” rules
Rule-based classifier = set of classification rules
What is a classification rule?
Classification rule: Condition -> y
condition: conjunction of attribute tests
y: class label
What is rule coverage?
- Fraction of all records that satisfy the condition of a rule
- Fraction of all records that are covered by the rule
What is accuracy of a rule?
- Fraction of covered records that are classified correctly
What are the characteristics of rule-based classifiers?
Mutually Exclusive Rule Set:
- the rules contained in the classifier are independent of each other
- every record is covered by at most one rule
Exhaustive Rule Set:
- classifier has exhaustive coverage if it accounts for every possible combination of attribute values
- each record is covered by at least one rule
How can you fix a rule set that is not mutually exclusive?
Solution 1: Ordered Rules
- by accuracy
- classify according to highest-ranked rule
Solution 2: Voting
- all matching rules vote and assign majority class label
- votes may be weighted by rule quality (accuracy)
How can you fix a rule set that is not exhaustive?
- Add default rule: () -> Y
What methods for rule-based classifiers exist?
- Direct Method
- Extract rules directly from data
- e.g. RIPPER - Indirect Method
- Extract rules from other classification models
- e.g. C4.5rules
Explain the indirect method to derive rules from a decision tree
- Generate a rule for every path from the root to one of the leave nodes
- Rule set contains as much information as the tree
-> Generated rules are mutually exclusive and exhaustive!
Explain the indirect method: C4.5 rules
It applies rule simplification to the rule set. This makes the rule set no longer mutually exclusive. Thus, need to apply ordered rule set or voting schemes.
Approach:
- extract rules from an unpruned decision tree
- for each rule:
- consider alternative rule by removing one of the conjuncts
- compare the pessimistic error rate for r against all r’s
- prune if one of the r’s has lower pessimistic error rate
- repeat until we can no longer improve generalization error
Explain the direct method: RIPPER
- learns a ordered rule set from training data
- approach depends on 2-class or multi-class problem
Explain RIPPER for 2-class problem
- Choose the less frequent class as positive class and the other as negative class
- learn rule for the positive class
- negative class will be default class
Explain RIPPER for multi-class problem
- Order classes according to increasing frequency
- Learn the rule set for smallest class first, treat rest as negative class
- repeat with next smallest class as positive class
How does RIPPER use sequential covering?
- Start from a empty rule list
- Grow a rule that covers as many positive examples as possible
- remove training records covered by the rule
- Repeat step 2 and 3 till stopping criterion is met
What are the aspects of sequential covering?
- Rule Growing
- Rule Pruning
- Instance Elimination
- Stopping Criterion
Explain rule growing within the ripper algorithm
- Start with empty rule {} -> class
- Step by step add conjuncts that (based on FOIL’s information gain measure)
a. improve accuracy of the rule
b. rule covers many examples
Goal: Prefer rules with high accuracy and high support count
Explain what rule pruning is within the RIPPER algorithm
- Because of the stopping criterion, the learned rule is likely to overfit the data
- > Rule is pruned afterwards using a validation dataset (similar to post-pruning of decision trees)
How does the rule pruning procedure in RIPPER work?
- Remove one of the conjuncts in the rule
- compare error rate on validation dataset before and after pruning
- if error improves, prune the conjunct
Measure for pruning:
v= (p - n) / (p + n)
p: # positive examples covered by the rule in validation set
n: # negative examples covered by the rule in the validation set
What is the goal of Rule Pruning in RIPPER?
Goal: Decrease generalization error of the rule
Why do we need to remove positive instances in the RIPPER algorithm?
- Otherwise the next rule is identical to the previous rule
Why do we remove negative instances in the RIPPER algorithm?
- To prevent underestimating accuracy of a rue
What is the stopping criterion to add new rules to the rule set for RIPPER?
- error rate of new rule on validation set must not exceed 50%
- minimum description length should not increase more than d bits
What are the advantages of rule-based classifiers
- Easy to interpret for humans (eager learning)
- Performance comparable to decision trees
- Fast classification of unseen records
- Well suited to handle imbalanced data sets, because they learn rules for minority class first