L6: Decision Trees Flashcards
What does it mean to create “purer” splits, i.e., lower entropy?
You strike to choose those splits that maximally reduce uncertainty about the outcome being positive or negative
Good splits create purer outcome partitions and you reduce the number of splits necessary to determine the outcome
TRUE/FALSE
TRUE
Which measures are commonly used to measure overall impurity?
A) entropy
B) philantropy
C) gini index
D) impurity
A) entropy
C) gini index
Entropy is a measure of impurity that quantifies the disorder or randomness in a set of data. The goal is to make the subsets resulting from node splits as pure as possible
TRUE/FALSE
TRUE
High entropy means that the dataset is pure
TRUE/FALSE
FALSE
High entropy implies high randomness and disorder in subsets resulting from the node splits - not good
Gini index measures impurity by calculating the probability of misclassifying an element in a dataset
TRUE/FALSE
TRUE
When building a decision tree, the algorithm aims to find the optimal feature and split point that MINIMIZES/ MAXIMIZES entropy and MINIMIZES/ MAXIMIZES gini. The split should organize the data in a way that makes the classes within each subset more HETEROGENEOUS/ HOMOGENEOUS
FILL IN THE BLANK
When building a decision tree, the algorithm aims to find the optimal feature and split point that MINIMIZES entropy and MINIMIZES gini. The split should organize the data in a way that makes the classes within each subset more HOMOGENEOUS
What is the term for the final node in the tree?
Leaf node/ terminal node
What is the term for any node above the terminal/ leaf node?
Decision node
With many categorical variables, number of possible splits becomes huge, and the decision tree becomes very computational intensive. How can this be solved?
It can be a good idea to group the “rare” categories in e.g., an “other” category to minimize the number of splits
Recursive partitioning is a step-by-step process of asking questions and making decisions to split a group into smaller and more homogeneous subsets, thus reducing entropy
TRUE/FALSE
TRUE
What is behind the trade-off when deciding the tune-length (complexity) of a decision tree?
Positive: captures more complex relational effects between inputs and output
Negative: increases the computational intensity and makes the tree prone to overfitting
Overfitting produces poor predictive performance – past a certain point in tree complexity, the error rate on new data starts to increase
TRUE/FALSE
TRUE
In terms of pruning, what did you do in the project?
A) pre-set max depth
B) post-pruning strategy
C) CHAID: using chi-squared statistical test to limit tree growth: stop when improvement is not statistically significant
B) post-pruning strategy
We tried to set the tune-length to 10, 15 and 25, and chose 15, because it significantly improved the model’s predictive ability.
Meanwhile, 25 only improved it very slightly, and we determined that this improvement did not outweigh the increased complexity
Why did you not use the complexity parameter and cross validation error to determine optimal tune length?
We tried to apply the same method as in exercise class, determining the tune length that minimizes the cross validation error, which has a corresponding complexity parameter.
But the function did not work for our model with interactions, and for model 3, the tune-length that minimized CV error was +300, which didnt’ make sense.
Instead, we looked up the suitable tune-length based on our dataset size
What is the difference between classification goal and ranking goal in binary classification?
Classification goal: whether a given observation is TRUE
Ranking goal: which observations are among the top x% in terms of likelihood to be TRUE
There is a trade-off between recall and precision – it is not possible to maximize both metrics simultaneously. How so?
Maximizing one metric often comes at the expense of the other. There’s a trade-off between capturing more positive instances (high recall) and ensuring the identified positives are accurate (high precision).
The lower the threshold, the more false positives we will have, but false negatives will be scarcer.
In the cancellation prediction setting, the lower the threshold, the more true cancellations we will capture, but this is at the expense of catching more false positives - i.e., classifying more bookings as cancelled, which turn out not to be
TRUE/FALSE
TRUE
if a false negative prediction (predicted to not cancel but end up cancelling) is more costly, the cut-off shall be set lower (closer to 0)
TRUE/FALSE
TRUE
What does an AUC of 0.8 mean? Interpret this
AUC answers the question: what’s the probability that a randomly chosen positive case will receive a higher score from your classifier than a randomly chosen negative case?
With AUC = 0.8, it means that a randomly chosen true cancellation observation will be ranked higher than a randomly chosen non-cancellation observation by 80%
If we scan only 20% of records but our classifier finds 80% of all responders, we have a lift of?
A) 4
B) 8
C) 6
Interpret the result
Lift = % responders accumulated / % all records scanned
0.8/0.2=4
Our classifier finds responders at 4x the rate of random guess model
Bagging and boosting are two ensemble learning techniques for decision trees. Which of the following statements are NOT true for bagging?
A) multiple instances of the base model (tree) are trained on different subsets of training data
B) subsets are created by random sampling
C) it helps reduce overfitting and variance of the model, increasing robustness and generalisability
D) combines multiple weak learners sequentially, with each tree focusing of correcting the errors of its predecessor
E) trees are trained independently
WRONG:
D) combines multiple weak learners sequentially, with each tree focusing of correcting the errors of its
predecessor – THIS IS THE CASE FOR BOOSTING
TRUE ABOUT BAGGING:
A) multiple instances of the base model (tree) are trained on different subsets of training data
B) subsets are created by random sampling
C) it helps reduce overfitting and variance of the model, increasing robustness and generalisability
E) trees are trained independently
Bagging and boosting are two ensemble learning techniques for decision trees. Which of the following statements are NOT true for boosting?
A) combines multiple weak learners (shallow trees) sequentially, each tree focusing on correcting the error of its predecessor
B) trees are built sequentially, and each subsequent tree gives more weight to instances misclassified by the previous ones
C) the final prediction is a weighted sum of the individual tree predictions
D) it aims at improving accuracy by emphasising instances that are challenging to classify
E) The focus on errors helps the ensemble adapt and perform well on difficult-to-learn patterns in the data.
All statements are correct
A) combines multiple weak learners (shallow trees) sequentially, each tree focusing on correcting the error of its predecessor
B) trees are built sequentially, and each subsequent tree gives more weight to instances misclassified by the previous ones
C) the final prediction is a weighted sum of the individual tree predictions
D) it aims at improving accuracy by emphasising instances that are challenging to classify
E) The focus on errors helps the ensemble adapt and perform well on difficult-to-learn patterns in the data.
Bagging aims to reduce overfitting and increase stability by training trees independently on different subsets.
Boosting focuses on improving accuracy by sequentially training trees, with each tree emphasizing the correction of errors made by the previous ones.
TRUE/FALSE
TRUE