Evaluation Metrics for Classification Flashcards
Accuracy
It tells us about the fraction of correct decisions among the predictions made by the model. It can be computed by dividing the difference between predictions and original outcomes by total number of records.
It’s the same as calculating a mean.
How to decided the threshold for converting predicted probabilities into binary outcomes?
Normally, we can say that if a probability is greater than 0.5 it’s positive outcome and if it’s smaller than 0.5 it’s negative outcome. But we can choose a number of thresholds and check if the accuracy is improved or not e.g. it can be 0.3 or 0.7 etc.
In numpy, you can use np.linspace(0,1,21).
How to count the number of values in Python?
from collections import Counter
Counter(y_pred >= 1.0)
It will count the number of values.
Why is accuracy not the right metric?
If we calculate accuracies from different thresholds from 0 to 1, then at 1, we do have a pretty good accuracy which means that all of the outcomes are positive or negative which can’t be correct in real world scenerio. This often occurs in class imbalances.
Confusion Matrix
A way to evaluate the model which is not affected by class Imbalance.
With threshold, we can have two possible scenerio for both positive and negative class. So we can have True positive, False Positive, True Negative, False Negative. Both False Positive and False Negative are incorrect predictions.
True Positive:
g(xi) >= t & y= 1
True Negative:
g(xi) <= t & y = 0
False Positive:
g(xi) >= t & y = 0
False Negative:
g(xi) < t & y= 1
We create a matrix out of it.
[TN, FP
FN, TP]
If we divide confusion_matrix /confusion_matrix.sum(), then we get correct accuracies for each of the four values.
Precision and Recall
The values in confusion matrix can be used to express different values. e.g.
Accuracy: (TP + TN) / (TP + TN + FP + FN)
Precision tells us fraction of positive predictions turned out to be correct.
Precision: TP/TP+FP (positive class)
Recall: Fraction of correctly identified positive outcomes.
Recall: TP / (TP + FN)
It’s useful where class Imbalance is present.
ROC Curves
ROC stands for receiver operating curves.
It’s a way to evaluate performance of binary classification.
This was used to detect the strength of signals to detect planes.
We are interested in False Positive Rates and True Positive Rates.
FPR = FP/(TN+FP) #first row of confusion matrix
TPR = TP /(FN+TP) #second row of confusion matrix
TPR is equal to recall.
These two values are created for all possible thresholds which forms the ROC curves.
df[::10] means with increments of 10, print every 10th element.
We need to compare our ROC curves with the random model.
We also want to compare it with an Ideal model i.e. 100% accuracy
Calculate ROC Curves using scikit learn
from sklearn.metrics import roc_curve
FPR, TPR, thresholds = roc_curve(y_val, y_pred)
ROC AUC
Area Under ROC Curve (AUC) is a useful metric for binary classification model.
For ROC Curve, we want to be as close to the ideal point as possible. These are the best models. If it’s closer to the random line, then it’s a bad model and anything below the random line means that there is something wrong.
Greater the AUC, better the model.
Half of the ROC has AUC as 0.5.
Full AUC is 1.0. Closer to ideal ROC would have 0.9 or 0.8 AUC.
Closer to random ROC has AUC 0.6.
from sklearn.metrics import auc
auc(FPR, TPR)
from sklearn.metrics import roc_auc_score
roc_auc_score(y_val, y_pred)
The probability of a randomly selected positive being greater than the probability of randomly selected negative example is called area under the curve.
K-Fold Cross Validation
We keep the test data separately.
We split the full training dataset into k parts e.g.
So we use part 1 & 2 as training dataset and part 3 as validation dataset.
And then we can train 1 & 3 and part 2 as validation dataset.
We can calculate AUC on validation dataset and calculate the mean AUC, standard deviation.
from sklearn.model_selection import K-Fold
K-Fold = KFold(n_splifs=10, shuffle=True,random_state=1)
Indx_train, indx_val = K-Fold.split(df_full_train)
Usual holdout CV is okay for smaller dataset. For bigger datasets, you can do more splits.
How to check how long an iteration taking?
We can use this library.
from tqdm.auto import tqdm
for i in tqdm(range(0,1)):
ROC AUC Feature Importance
ROC AUC can also be used to evaluate feature importance of the numerical variables
For each numerical variable, use it as a score, compute, AUC with the target variable.