ML Metrics Flashcards
Offline metrics for classification models
Precision, recall, F1 score, accuracy, ROC-AUC, PR-AUC,
confusion matrix
Offline metrics for regression
Mean squared error (MSE)
MAE
RMSE
Offline metrics for ranking system
- MRR
- mAP
- nDCG
Online metric for ad click prediction
- Click through rate
- Ad revenue
Online metric for harmful content detection
- Number of reports
- Actioned reports
Online metric for video recommendations
- Click through rate
- Total watch time
- Number of completed videos
Types of Loss Functions
Mean squared error
Categorical cross-entropy loss
Binary cross-entropy loss
Mean squared error
- Measures the difference between the predicted output and the true output
- Used to optimize the model parameters during training
- what we’re trying to minimize when we train a model
precision
positive predictive value
probability a sample classified as positive is actually positive
TP/(TP+FP)
recall
same as true positive rate
true positives / total positives
TP / (TP+FN)
sensitivity of the classification
What’s the best metric when you have a large number of negative samples
Precision and recall
Precision is not affected by a large number of negative samples because it measures the fraction of true positives out of the number of predicted positives (TP +FP).
Precision measures the probability of correct detection of positive values while FPR, TPR, and ROC measure the ability to distinguish between classes.
Highest value of F1
1.0 indicating perfect precision and recall
Lowest value of F1
0 if either precision or recall are 0
AUC range
0 to 1
ROC
- true positive rate (recall) on the y axis
false positive rate on the x axis - captures the performance of a classification model at all classification thresholds (probability thresholds)
- does not depend on class distribution!
- receiver operating characteristic curve
AUC
- area under the ROC curve
- used to evaluate a binary classification model
- Quantifies the ability of the model to correctly classify
AUC ranges from 0 to 1
AUC of 0
A model that is 100% wrong
AUC of 1
A model that is 100% correct
What’s the best metric when you have a large number of positive samples
ROC is a better metric
What metric should you use when detection of both classes is equally important
ROC
F1
- used to evaluate the performance of a binary classification model
- combines precision and recall into a single measure
- harmonic mean of the precision and recall which provides a balanced measure of the model’s accuracy
- F1 is 0 if either precision or recall is 0
true positive rate
aka recall
true positives / all positives
TP / (TP + FN)
Offline Metrics
Score the model when building it
Before model is put into production (train, eval, and test datasets)
Examples of offline metrics: ROC, AUC, F1, R^2, MSE, intersection over union
online metrics
scores from model once it is running in prod and serving
domain specific. things like click through rate or minutes spent watching a video.
MRR
- mean reciprocal rank
- only considers the rank of the first relevant item
- not a good measure of the quality of the list as a whole
mAP
- mean average precision
- good for ranking problems
- works well for binary relevance (relevant or irrelevant).
- For continuous relevance scores use nDCG
nDCG
- winner, winner for ranking problems
- continuous relevance score
- shows how good the ranking is compared to the ideal ranking
- takes into account the position of the relevant item in a ranked list
Ranges from 0 to 1. Higher values indicate better performance.
nDCG acronym
normalized discounted cumulative gain
Cross entropy
- how close the model’s predicted probabilities are to the ground truth label.
- CE is zero if we have an ideal system that predicts a 0 for the negative classes and 1 for the positive classes.
- The lower the CE, the higher the accuracy of the prediction.
- Good for ad click
Normalized cross entropy.
- Normalized cross entropy.
- Ratio of our model’s CE and the CE of the background CTR.
- Low NCE indicates the model outperforms the baseline.
- NCE ≥1 indicates that the model is not performing better than the baseline.
- Good for ad click