L6: Decision Trees Flashcards
What does it mean to create “purer” splits, i.e., lower entropy?
You strike to choose those splits that maximally reduce uncertainty about the outcome being positive or negative
Good splits create purer outcome partitions and you reduce the number of splits necessary to determine the outcome
TRUE/FALSE
TRUE
Which measures are commonly used to measure overall impurity?
A) entropy
B) philantropy
C) gini index
D) impurity
A) entropy
C) gini index
Entropy is a measure of impurity that quantifies the disorder or randomness in a set of data. The goal is to make the subsets resulting from node splits as pure as possible
TRUE/FALSE
TRUE
High entropy means that the dataset is pure
TRUE/FALSE
FALSE
High entropy implies high randomness and disorder in subsets resulting from the node splits - not good
Gini index measures impurity by calculating the probability of misclassifying an element in a dataset
TRUE/FALSE
TRUE
When building a decision tree, the algorithm aims to find the optimal feature and split point that MINIMIZES/ MAXIMIZES entropy and MINIMIZES/ MAXIMIZES gini. The split should organize the data in a way that makes the classes within each subset more HETEROGENEOUS/ HOMOGENEOUS
FILL IN THE BLANK
When building a decision tree, the algorithm aims to find the optimal feature and split point that MINIMIZES entropy and MINIMIZES gini. The split should organize the data in a way that makes the classes within each subset more HOMOGENEOUS
What is the term for the final node in the tree?
Leaf node/ terminal node
What is the term for any node above the terminal/ leaf node?
Decision node
With many categorical variables, number of possible splits becomes huge, and the decision tree becomes very computational intensive. How can this be solved?
It can be a good idea to group the “rare” categories in e.g., an “other” category to minimize the number of splits
Recursive partitioning is a step-by-step process of asking questions and making decisions to split a group into smaller and more homogeneous subsets, thus reducing entropy
TRUE/FALSE
TRUE
What is behind the trade-off when deciding the tune-length (complexity) of a decision tree?
Positive: captures more complex relational effects between inputs and output
Negative: increases the computational intensity and makes the tree prone to overfitting
Overfitting produces poor predictive performance – past a certain point in tree complexity, the error rate on new data starts to increase
TRUE/FALSE
TRUE
In terms of pruning, what did you do in the project?
A) pre-set max depth
B) post-pruning strategy
C) CHAID: using chi-squared statistical test to limit tree growth: stop when improvement is not statistically significant
B) post-pruning strategy
We tried to set the tune-length to 10, 15 and 25, and chose 15, because it significantly improved the model’s predictive ability.
Meanwhile, 25 only improved it very slightly, and we determined that this improvement did not outweigh the increased complexity
Why did you not use the complexity parameter and cross validation error to determine optimal tune length?
We tried to apply the same method as in exercise class, determining the tune length that minimizes the cross validation error, which has a corresponding complexity parameter.
But the function did not work for our model with interactions, and for model 3, the tune-length that minimized CV error was +300, which didnt’ make sense.
Instead, we looked up the suitable tune-length based on our dataset size
What is the difference between classification goal and ranking goal in binary classification?
Classification goal: whether a given observation is TRUE
Ranking goal: which observations are among the top x% in terms of likelihood to be TRUE
There is a trade-off between recall and precision – it is not possible to maximize both metrics simultaneously. How so?
Maximizing one metric often comes at the expense of the other. There’s a trade-off between capturing more positive instances (high recall) and ensuring the identified positives are accurate (high precision).
The lower the threshold, the more false positives we will have, but false negatives will be scarcer.
In the cancellation prediction setting, the lower the threshold, the more true cancellations we will capture, but this is at the expense of catching more false positives - i.e., classifying more bookings as cancelled, which turn out not to be
TRUE/FALSE
TRUE
if a false negative prediction (predicted to not cancel but end up cancelling) is more costly, the cut-off shall be set lower (closer to 0)
TRUE/FALSE
TRUE