Evaluation Metrics Flashcards
When NOT to use accuracy?
- when there is a class imbalance in the data,
- because it’s possible for a model to have high accuracy by predicting the majority class every time.
In such a case, the model would score well, but it may not be a useful model.
Which metrics to use for Imbalanced Classes?
precision, recall, F1 score
What does Accuracy measure and what’s the formula?
Accuracy is the proportion of data points that are correctly classified.
Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)
Precision
Precision measures the proportion of positive predictions that are true positives.
Precision = True positives/ (True positives + False positives)
When to use Precision?
when it’s important to avoid false positives.
Recall
Recall measures the proportion of data points that are correctly classified.
When to use Recall?
Recall is a good metric to use when it’s important that you identify as many True responders as possible.
Or false negatives are more costly
For example, if your model is identifying poisonous mushrooms, it’s better to identify all of the true occurrences of poisonous mushrooms, even if that means making a few more false positive predictions.
Receiver operating characteristic (ROC) curves
visualize the performance of a classifier at different classification thresholds.
What does X and Y axis on ROC curve?
X = false positive rate and Y = true positive rate at the corresponding threshold.
What is the ideal ROC model
An ideal model perfectly separates all negatives from all positives, and gives all
- real positive cases a very high probability (~1) and
- all real negative cases a very low probability (~0)
What does an ideal ROC curve look like?
The primary goal of using an ROC curve is to assess the performance of a classification model.
A good model will have an ROC curve that is closer to the top-left corner of the plot, indicating high TPR and low FPR across different threshold settings.
What does AUC (Area Under the Curve) measure?
AUC is the area under an ROC curve.
What does it measure?
* Overall performance of a classification model.
* Ability of the model to distinguish between positive and negative classes.
Interpretation:
- Higher AUC: Better model performance.
- AUC of 1: Perfect model.
- AUC of 0.5: No better than random guessing.
Key Points:
- Independent of class imbalance: AUC is not affected by the proportion of positive and negative examples in the dataset.
- Considers all classification thresholds: It evaluates the model’s performance across different thresholds.
When to use AUC:
- When you want to evaluate the overall performance of a classification model.
- When you care about the ranking of positive and negative instances.
F1 score
is a measurement that combines both precision and recall into a single expression, giving each equal importance.
What is the difference between harmonic mean (F1 score) vs arithmetic mean?
Why is f1 score a more robust metric for classification model?
Harmonic mean: Gives more weight to lower values. This means that if either precision or recall is low, the F1 score will be significantly impacted.
Arithmetic mean: Gives equal weight to all values.
Why F1 Score is Always Less Than or Equal to the Mean
Because the F1 score is more sensitive to lower values, it will always be less than or equal to the arithmetic mean of precision and recall. This is because the harmonic mean is always less than or equal to the arithmetic mean for positive numbers.
To illustrate this:
* If precision and recall are equal, the F1 score will be equal to the mean.
* If precision and recall are different, the F1 score will be less than the mean.
In essence, the F1 score penalizes imbalances between precision and recall more severely than the arithmetic mean.
This makes it a more robust metric for evaluating classification models when both precision and recall are important
What does a F1 Score of 0 or 1 mean?
he F1 score range is 0 to 1, with a higher score indicating better performance.
Here’s a breakdown:
* 0: This means the model is terrible. It cannot correctly identify any of the classes.
* 0.5: This means the model is average. It can barely distinguish between the classes.
* 0.7-0.8: This is considered good performance. The model is able to identify most of the classes correctly, with a focus on precision and recall (not necessarily equally).
* 0.9-1.0: This is considered excellent performance. The model is able to identify all of the classes correctly, with a high balance between precision and recall