Model Evaluation Flashcards
Confusion Matrix
Used for assessing model performance
- Can become large for multi-classification
- For binary (e.g. Logistic regression):
1. True Positive: Correctly classified as positive (good)
2. False Positive: Incorrectly classified as positive (bad)
3. True Negative: Correctly classified as negative (good)
4. False Negative: Incorrectly classified as negative (bad)
Sensitivity
Also known as true positive rate, or recall:
- Number of correct positives out of the actual positive results
- I.e. % of correct predictions that were true classifications
- TPs / (TPs + FNs)
- Closer to 1 is better
Specificity
- Also known as true negative rate
- The number of correct negatives out of the actual negative results
- TNs / (TNs + FPs)
- Closer to 1 is better
When is sensitivity more important?
When False Positives are acceptable but False Negatives are not. E.g. Detecting fraudulent transactions, medical diagnosis
When is specificity more important?
When False Negatives are acceptable but False Positives are not. E.g. model that ensures images are appropriate for children.
Accuracy
The proportion of all predictions that were correctly identified. I.e. how right is the model?
- (TPs + TNs) / Total
Precision
The proportion of all actual positives that were correctly identified
- TPs / (TPs + FPs)
ROW / AUC
- In binary classification, what is the point at which a value is classified to one side or the other? We can adjust our decision based on sensitivity or specificity
- As we adjust the line up and down between 0 and 1, we achieve different confusion matrices with different results
- We plot the TP rate and FP rate for each of the confusion matrices against each other and achieve the ROC curve.
- The “knee points” in the curve will give us the optimum points for sensitivity and specificity
- When comparing models, we plot a ROC for each, compare the AUC and choose the model with the largest
Gini Impurity
- A metric for assessing the accuracy of decision trees
- Goal is to assess the impact of the question itself, and whether it should be at the root node or not
- We compare each of the features by Gini Impurity and choose that with the lowest weighted, which defines the feature which best separates values to each class
Gini Impurity (calculation)
1 - (probability of class 1)^2 - (probability of class 2)^2
F1 Score
- Combination of Recall and Precision
- 2 / ((1/Recall) + (1/Precision))
- Can be good to split where we have very similar accuracy scores
- Takes into account both the FPs and FNs
- Has been proven as a better indication of the model if you have an uneven class distribution
Linear regression metrics: SSE, Rsquared, Adj RSquared
- SSE
- R squared: 1 - (SSE / Var)
- Value between 0 and 1
- Extent to which the variance in the data is explained by the model
- Closer to 1 better
- Adding more variables leads to higher R squared - doesn’t account for overfitting
- Adjusted R squared
- 1 - (1 - Rsquared) * ((no of data points - 1) /
(no of data points - no of variables - 1)) - Takes into account the effect of adding more variables
- 1 - (1 - Rsquared) * ((no of data points - 1) /
Linear regression metrics: Confidence intervals
- Normal distribution: Majority of density is contained within +/- 3 std devs of the mean
- Central Limit Theorem: no matter what the original distribution is of X, the mean of X (i.e. X hat) will follow a normal distribution (helps to give us Confidence Intervals)
- Confidence intervals quantify margin-of-error between sample metric and true metric due to sampling randomness
- 90% CI: if we randomly choice a sample from the population and make 100 predictions, 90 of those predictions should fall in our confidence interval
- We can have CIs for proportions or means