Supervised Learning: Classification Flashcards

1
Q

What are the hyperparameters of logistic regression?

A

C: smaller values = stronger REGULARIZATION, i.e. incentivizing the model to reduce the influence of as many predictors as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Random Forest model?

A

Collection of Decision Tree models trained on random subsets of data, combined in the following way:
Their probabilities of each observation being in each category are averaged out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are unbalanced samples and what are some common techniques of correcting them?

A

Unbalanced samples are when, say, 90% of the y’s fit into category 0 and only 10% fits into category 1.

Techniques:

  • Intentionally collecting more data on the minority class
  • Undersampling the majority class (by discarding a random subset of its data)
  • Creating synthetic samples from the minority class (e.g., SMOTE method)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If you train a classification model with an altered TRAINING dataset that intentionally oversampled the minority class or undersampled the majority class to correct an imbalance: Should you do the same with the TEST dataset?

A

No.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What underlies logit’s eventual classification of each data point into a category/class?

A

The model actually outputs an underlying PROBABILITY of the data point being in each category. It then classifies the point into the category with the highest probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the basic syntax for running a logit?

A

from sklearn.linear_model import LogisticRegression()

logit = LogisticRegression()
logit.fit(X, y)
y_pred = logit.predict(X)
y_pred_probz = logit.predict_proba(X) # this is key to remember

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s the syntax for measuring the accuracy of a logit model? Is that a good measure of how good the model is?

A

from sklearn import metrics

logit. score(X, y) # from the trained model itself, OR
metrics. accuracy_score(y, y_pred) # doesn’t need the model, just the actual Ys vs predicted Ys

Accuracy is a decent measure when the classes are somewhat balanced. But if in the actual data class 0 is 90% of the cases, it becomes problematic. At that point, accuracy could be used as a first pass but nothing more - the model better be >90% accurate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define Recall and Precision. Which kind of error does each of them minimize?

A

Note: Both recall and precision are defined for each category, not for all categories at once. The individual categories’ Recall and Precision can then be weighted-averaged.

Recall: Out of data points that really were Category 1, how many were labeled as Category 1? High Recall minimizes false negatives.

Precision: Out of data points that were labeled as Category 1, how many were really Category 1? High Precision minimizes false positives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the major measure that evaluate a classification model’s quality? What is their syntax? How do you see them all at once?

A

from sklearn import metrics

Accuracy: metrics.accuracy_score(y, y_pred)
Precision: metrics.precision_score(y, y_pred)
Recall: metrics.recall_score(y, y_pred)
F-score: metrics.f1_score(y, y_pred)

# All at once:
metrics.classification_report(y, y_pred) # also prints "support", ie what the real N was in each category
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a confusion matrix? How do you print it?

A

If there are N categories to classify into, the confusion matrix is an NxN matrix, where rows are real labels & columns are predicted labels. Within each cell is one number K that tells you how many fell in that cell (a good model will have high #s in the diagonal and low #s in the rest of the matrix).

# Syntax:
metrics.confusion_matrix(y, y_pred)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The default threshold for a binary classification is p=.5. How would you change it to, say, .6? And why would you ever want to do that?

A

HOW:
There’s no built-in hyperparameter to do this, so have to do it semi-manually via:
y_pred_probz = logit.predict_proba(X)
, then manually classify each prob as 0 or 1 based on whether it’s above or below .6.

WHY:
To manipulate the model’s Recall and/or Precision to be above a min acceptable level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Let’s say there’s a requirement that your logit model’s Precision is at least .8 and Recall is at least .6. Your first model (with standard p=.5) has higher Precision than that but lower Recall than that. How could you try to improve the model?

A

Simple way:
Vary the p-threshold from 0.5 (semi-manually) and measure the Precision and Recall each time.

Fancier way:
Use metrics.precision_recall_curve() to auto-generate a (Precision, Recall) pair for many values of p.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does metrics.auc(recalls, precisions) do?
Where does each (recall, precision) pair come from?
What is a “good” AUC score?

A

AUC is a way to evaluate a classification model (e.g., logit or random forest) in a holistic way, taking into account many possible values of the model’s hyperparameters (e.g., the p-threshold for logit).

It stands for Area Under Curve, where the curve is Precision on one axis and Recall on the other. The (recall, precision) pairs are generated in the first place by varying the model’s hyperparameters. For example, each p-threshold value in a logit model will generate its own such pair.

AUC varies between 0 and 1; higher is better. 1 is when both precision and recall are perfect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does Receiver Operating Characteristic (ROC) mean?

A

It’s the same thing as the AUC (Area Under Curve), where the “Curve” is a plot of Precisions vs Recalls for iterations of the same model with different hyperparameters (e.g., different p-thresholds for a logit model)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In sklearn, how does multilevel logit work relative to standard binary logit?

A

It is “bootstrapped” from binary logit via the “one vs all” method. This means taking one category at a time and running a binary logit generating probabilities of the sample being “it” vs “not it”. For example, if there are 4 categories, 4 binary logits are run behind the scenes. The highest of the 4 “it” probabilities for each data point is the category assigned to that data point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is Decision Tree / Random Forest a regression or a classification model?

A

Trick question: it can be either!

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

17
Q

What are the two broad subtypes of supervised learning?

And what are a few examples of each subtype?

A

Regression and classification.
Regression predicts a continuous outcome variable. Classification predicts a discrete/categorical outcome variable.

Regression examples: linear regression (can be polynomial), Ridge regression, Lasso regression, decision tree, random forest.

Classification examples: logistic regression, decision tree, random forest.

18
Q

In a nutshell, how do computers minimize the cost function without finding the exact min (aka “closed solution”) mathematically in a complex multi-dimensional space?

A

Gradient Descent:

  1. Calculate the value of the error/cost
  2. Move a very small distance in the (multidimensional) direction of all the predictors where the function is decreasing the fastest.
  3. Repeat #1
  4. Do 1-3 a fixed # of times, or until the error functionally stops decreasing.
  5. Reaching the GLOBAL min of the cost function is likely but not guaranteed!
19
Q

What are the 3 steps involved in training a supervised ML model?

A
  1. Choose a FAMILY of models (e.g., linear regression)
  2. Choose an ERROR METRIC / cost function (often implicitly chosen by default in sklearn)
  3. ITERATE to find the specific model in the family of models that minimizes the cost function (this is where the power of computing comes in!). For example, a linear regression model with specific Betas is a specific instance of the linear regression family of models.