Probability, Prediction, and Classification Flashcards
What is the difference in interpretating probability prediction and classification prediction?
Probability: predicting the probability of y = 1 for each observation
Classification: prediction whether yhat = 0 or yhat = 1 for each observation
What are two errors of the classification process?
False positives, false negatives
What is the confusion table?
It shows the number of observations by their predicted class and actual class. The quadrants are:
TN | FN | Total classified N FP | TP | Total classified P Total true N | Total true P | All observations
What are the three measures of classification?
1) Accuracy = (TP + TN) / N : The proportion of rightly guessed observations
2) Sensitivity = TP / (TP + FN) : The proportion of true positives among all actual positives
3) Specificity = TN / (TN + FP) : The proportion of true negatives among all actual negatives
T/F: There is a trade-off between making false positive and false negative errors.
True, can be expressed with specificity and sensitivity
What does the ROC curve show?
The proportion of false positives among all y = 0 observations (1 - specificity) and the proportion of true positives among all y = 1 observations (sensitivity)
T/F: The ROC curve of a completely random probability prediction is the 45 degree line.
True
T/F: The lower the area under the ROC curve, the better our predictions are.
False, want a higher number than random (0.5)
What are the two ways of finding the optimal classification threshold?
1) Use the formula loss(FP) / (loss(FP) + loss(FN))
2) Use a search algorithm that selects the probability model and the optimal classification threshold together.
Fill in the blanks.
Having a higher threshold leads to _____ and vice versa
Fewer predicted exits, fewer FP, but more FN.
Define class imbalance
The event being studied is very rare or very frequent
Ex. Fraud or sport injury
What are the consequences of having class imbalance?
Cross-validation can be less effective at avoiding overfititng.
The usual meaures of fit can be less good at differentiating models.
aka, poor model performance and model fitting and selection setup not ideal.
What do you do when you have a class imbalance?
1) Need to know when its happening
2) May need to rebalance the sample (downsampling or oversampling)
3) Or smart algorithms
What is the difference between downsampling and over-sampling?
Downsampling: randomly dropping observations from the frequent class
Over-sampling: getting more observations on rare event
When you consider higher and higher thresholds of predicted probabilities for classification, the number of false positives and false negatives changes. How and why?
Higher thresholds means it is harder to become “positive”, so this increases the number of FN in the data. Vice versa, lower thresholds means it is easier to become “positive”, so the number of FP increases. This is why finding the optimal threshold is important, we want to find the threshold that calcualtes the lease false reports.