Probability, Prediction, and Classification Flashcards

Question 1

Q

What is the difference in interpretating probability prediction and classification prediction?

Answer

A

Probability: predicting the probability of y = 1 for each observation
Classification: prediction whether yhat = 0 or yhat = 1 for each observation

Question 2

Q

What are two errors of the classification process?

Answer

A

False positives, false negatives

Question 3

Q

What is the confusion table?

Answer

A

It shows the number of observations by their predicted class and actual class. The quadrants are:

   TN        |          FN    | Total classified N
   FP        |        TP       | Total classified P Total true N | Total true P | All observations

Question 4

Q

What are the three measures of classification?

Answer

A

1) Accuracy = (TP + TN) / N : The proportion of rightly guessed observations
2) Sensitivity = TP / (TP + FN) : The proportion of true positives among all actual positives
3) Specificity = TN / (TN + FP) : The proportion of true negatives among all actual negatives

Question 5

Q

T/F: There is a trade-off between making false positive and false negative errors.

Answer

A

True, can be expressed with specificity and sensitivity

Question 6

Q

What does the ROC curve show?

Answer

A

The proportion of false positives among all y = 0 observations (1 - specificity) and the proportion of true positives among all y = 1 observations (sensitivity)

Question 7

Q

T/F: The ROC curve of a completely random probability prediction is the 45 degree line.

Question 8

Q

T/F: The lower the area under the ROC curve, the better our predictions are.

Answer

A

False, want a higher number than random (0.5)

Question 9

Q

What are the two ways of finding the optimal classification threshold?

Answer

A

1) Use the formula loss(FP) / (loss(FP) + loss(FN))
2) Use a search algorithm that selects the probability model and the optimal classification threshold together.

Question 10

Q

Fill in the blanks.

Having a higher threshold leads to _____ and vice versa

Answer

A

Fewer predicted exits, fewer FP, but more FN.

Question 11

Q

Define class imbalance

Answer

A

The event being studied is very rare or very frequent

Ex. Fraud or sport injury

Question 12

Q

What are the consequences of having class imbalance?

Answer

A

Cross-validation can be less effective at avoiding overfititng.
The usual meaures of fit can be less good at differentiating models.
aka, poor model performance and model fitting and selection setup not ideal.

Question 13

Q

What do you do when you have a class imbalance?

Answer

A

1) Need to know when its happening
2) May need to rebalance the sample (downsampling or oversampling)
3) Or smart algorithms

Question 14

Q

What is the difference between downsampling and over-sampling?

Answer

A

Downsampling: randomly dropping observations from the frequent class
Over-sampling: getting more observations on rare event

Question 15

Q

When you consider higher and higher thresholds of predicted probabilities for classification, the number of false positives and false negatives changes. How and why?

Answer

A

Higher thresholds means it is harder to become “positive”, so this increases the number of FN in the data. Vice versa, lower thresholds means it is easier to become “positive”, so the number of FP increases. This is why finding the optimal threshold is important, we want to find the threshold that calcualtes the lease false reports.

Probability, Prediction, and Classification Flashcards

(15 cards)