Supervised Learning: Classification Flashcards
What are the hyperparameters of logistic regression?
C: smaller values = stronger REGULARIZATION, i.e. incentivizing the model to reduce the influence of as many predictors as possible
What is a Random Forest model?
Collection of Decision Tree models trained on random subsets of data, combined in the following way:
Their probabilities of each observation being in each category are averaged out.
What are unbalanced samples and what are some common techniques of correcting them?
Unbalanced samples are when, say, 90% of the y’s fit into category 0 and only 10% fits into category 1.
Techniques:
- Intentionally collecting more data on the minority class
- Undersampling the majority class (by discarding a random subset of its data)
- Creating synthetic samples from the minority class (e.g., SMOTE method)
If you train a classification model with an altered TRAINING dataset that intentionally oversampled the minority class or undersampled the majority class to correct an imbalance: Should you do the same with the TEST dataset?
No.
What underlies logit’s eventual classification of each data point into a category/class?
The model actually outputs an underlying PROBABILITY of the data point being in each category. It then classifies the point into the category with the highest probability.
What is the basic syntax for running a logit?
from sklearn.linear_model import LogisticRegression()
logit = LogisticRegression()
logit.fit(X, y)
y_pred = logit.predict(X)
y_pred_probz = logit.predict_proba(X) # this is key to remember
What’s the syntax for measuring the accuracy of a logit model? Is that a good measure of how good the model is?
from sklearn import metrics
logit. score(X, y) # from the trained model itself, OR
metrics. accuracy_score(y, y_pred) # doesn’t need the model, just the actual Ys vs predicted Ys
Accuracy is a decent measure when the classes are somewhat balanced. But if in the actual data class 0 is 90% of the cases, it becomes problematic. At that point, accuracy could be used as a first pass but nothing more - the model better be >90% accurate.
Define Recall and Precision. Which kind of error does each of them minimize?
Note: Both recall and precision are defined for each category, not for all categories at once. The individual categories’ Recall and Precision can then be weighted-averaged.
Recall: Out of data points that really were Category 1, how many were labeled as Category 1? High Recall minimizes false negatives.
Precision: Out of data points that were labeled as Category 1, how many were really Category 1? High Precision minimizes false positives.
What are the major measure that evaluate a classification model’s quality? What is their syntax? How do you see them all at once?
from sklearn import metrics
Accuracy: metrics.accuracy_score(y, y_pred)
Precision: metrics.precision_score(y, y_pred)
Recall: metrics.recall_score(y, y_pred)
F-score: metrics.f1_score(y, y_pred)
# All at once: metrics.classification_report(y, y_pred) # also prints "support", ie what the real N was in each category
What is a confusion matrix? How do you print it?
If there are N categories to classify into, the confusion matrix is an NxN matrix, where rows are real labels & columns are predicted labels. Within each cell is one number K that tells you how many fell in that cell (a good model will have high #s in the diagonal and low #s in the rest of the matrix).
# Syntax: metrics.confusion_matrix(y, y_pred)
The default threshold for a binary classification is p=.5. How would you change it to, say, .6? And why would you ever want to do that?
HOW:
There’s no built-in hyperparameter to do this, so have to do it semi-manually via:
y_pred_probz = logit.predict_proba(X)
, then manually classify each prob as 0 or 1 based on whether it’s above or below .6.
WHY:
To manipulate the model’s Recall and/or Precision to be above a min acceptable level.
Let’s say there’s a requirement that your logit model’s Precision is at least .8 and Recall is at least .6. Your first model (with standard p=.5) has higher Precision than that but lower Recall than that. How could you try to improve the model?
Simple way:
Vary the p-threshold from 0.5 (semi-manually) and measure the Precision and Recall each time.
Fancier way:
Use metrics.precision_recall_curve() to auto-generate a (Precision, Recall) pair for many values of p.
What does metrics.auc(recalls, precisions) do?
Where does each (recall, precision) pair come from?
What is a “good” AUC score?
AUC is a way to evaluate a classification model (e.g., logit or random forest) in a holistic way, taking into account many possible values of the model’s hyperparameters (e.g., the p-threshold for logit).
It stands for Area Under Curve, where the curve is Precision on one axis and Recall on the other. The (recall, precision) pairs are generated in the first place by varying the model’s hyperparameters. For example, each p-threshold value in a logit model will generate its own such pair.
AUC varies between 0 and 1; higher is better. 1 is when both precision and recall are perfect.
What does Receiver Operating Characteristic (ROC) mean?
It’s the same thing as the AUC (Area Under Curve), where the “Curve” is a plot of Precisions vs Recalls for iterations of the same model with different hyperparameters (e.g., different p-thresholds for a logit model)
In sklearn, how does multilevel logit work relative to standard binary logit?
It is “bootstrapped” from binary logit via the “one vs all” method. This means taking one category at a time and running a binary logit generating probabilities of the sample being “it” vs “not it”. For example, if there are 4 categories, 4 binary logits are run behind the scenes. The highest of the 4 “it” probabilities for each data point is the category assigned to that data point.