Prev Exam Questions Flashcards

1
Q

Question 1)
a. You are given a dataset on cancer detection. The dataset has 800,000 patient records and 4% patiens in the dataset is positive for cancer. You’ve build a classification model to detect cancer and achieved an accuracy of 96%. Why shouldnt you be happy with your model performance?

b. What can you do about it?

A

a. Because it might be imbalanced dataset. Accuracy might not be the perfect measure. If only 4% of the cases are actually positive, the model can make all negative predictions and still end up with 96% accuracy. Even though it can still miss 100% of the positive cases and having a recall of 0.
b. Use other metrics like recall and precision. Recall can tell us what % of patients that are positve for cancer were correctly classified and precision can tell us what % of patients are correct and actually has cancer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

4)
a. Pruning is one of the strategies implemented in Decision Tree algorithm to reduce the risk of overfitting Explain, why pruning helps in mitigating overfitting risk

b. Describe any two methods used in “post-pruning” strategy

A

a. Pruning helps in mitigating overfitting risk because by default a decision tree model is allowed to grow to its full depth. Pruning is a technique that would help remove the parts of the decision tree to prevent growing to its full depth. It helps mitigating overfitting risk by tuning the hyperparameters of the decision tree model by either early stop the growth of the tree (pre-pruning)or by removing tree branches after the model has grown to its fullest (post-pruning)

b. 1. use validation dataset to evaluate the effect of post pruning nodes from the tree
2. build a tree using the training set, then apply a statistical test to estimate whether pruning or expanding a particular node is likely to produce an improvement beyond the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly