L6: Decision Trees Flashcards
What does it mean to create “purer” splits, i.e., lower entropy?
You strike to choose those splits that maximally reduce uncertainty about the outcome being positive or negative
Good splits create purer outcome partitions and you reduce the number of splits necessary to determine the outcome
TRUE/FALSE
TRUE
Which measures are commonly used to measure overall impurity?
A) entropy
B) philantropy
C) gini index
D) impurity
A) entropy
C) gini index
Entropy is a measure of impurity that quantifies the disorder or randomness in a set of data. The goal is to make the subsets resulting from node splits as pure as possible
TRUE/FALSE
TRUE
High entropy means that the dataset is pure
TRUE/FALSE
FALSE
High entropy implies high randomness and disorder in subsets resulting from the node splits - not good
Gini index measures impurity by calculating the probability of misclassifying an element in a dataset
TRUE/FALSE
TRUE
When building a decision tree, the algorithm aims to find the optimal feature and split point that MINIMIZES/ MAXIMIZES entropy and MINIMIZES/ MAXIMIZES gini. The split should organize the data in a way that makes the classes within each subset more HETEROGENEOUS/ HOMOGENEOUS
FILL IN THE BLANK
When building a decision tree, the algorithm aims to find the optimal feature and split point that MINIMIZES entropy and MINIMIZES gini. The split should organize the data in a way that makes the classes within each subset more HOMOGENEOUS
What is the term for the final node in the tree?
Leaf node/ terminal node
What is the term for any node above the terminal/ leaf node?
Decision node
With many categorical variables, number of possible splits becomes huge, and the decision tree becomes very computational intensive. How can this be solved?
It can be a good idea to group the “rare” categories in e.g., an “other” category to minimize the number of splits
Recursive partitioning is a step-by-step process of asking questions and making decisions to split a group into smaller and more homogeneous subsets, thus reducing entropy
TRUE/FALSE
TRUE
What is behind the trade-off when deciding the tune-length (complexity) of a decision tree?
Positive: captures more complex relational effects between inputs and output
Negative: increases the computational intensity and makes the tree prone to overfitting
Overfitting produces poor predictive performance – past a certain point in tree complexity, the error rate on new data starts to increase
TRUE/FALSE
TRUE
In terms of pruning, what did you do in the project?
A) pre-set max depth
B) post-pruning strategy
C) CHAID: using chi-squared statistical test to limit tree growth: stop when improvement is not statistically significant
B) post-pruning strategy
We tried to set the tune-length to 10, 15 and 25, and chose 15, because it significantly improved the model’s predictive ability.
Meanwhile, 25 only improved it very slightly, and we determined that this improvement did not outweigh the increased complexity
Why did you not use the complexity parameter and cross validation error to determine optimal tune length?
We tried to apply the same method as in exercise class, determining the tune length that minimizes the cross validation error, which has a corresponding complexity parameter.
But the function did not work for our model with interactions, and for model 3, the tune-length that minimized CV error was +300, which didnt’ make sense.
Instead, we looked up the suitable tune-length based on our dataset size
What is the difference between classification goal and ranking goal in binary classification?
Classification goal: whether a given observation is TRUE
Ranking goal: which observations are among the top x% in terms of likelihood to be TRUE
There is a trade-off between recall and precision – it is not possible to maximize both metrics simultaneously. How so?
Maximizing one metric often comes at the expense of the other. There’s a trade-off between capturing more positive instances (high recall) and ensuring the identified positives are accurate (high precision).
The lower the threshold, the more false positives we will have, but false negatives will be scarcer.
In the cancellation prediction setting, the lower the threshold, the more true cancellations we will capture, but this is at the expense of catching more false positives - i.e., classifying more bookings as cancelled, which turn out not to be
TRUE/FALSE
TRUE
if a false negative prediction (predicted to not cancel but end up cancelling) is more costly, the cut-off shall be set lower (closer to 0)
TRUE/FALSE
TRUE
What does an AUC of 0.8 mean? Interpret this
AUC answers the question: what’s the probability that a randomly chosen positive case will receive a higher score from your classifier than a randomly chosen negative case?
With AUC = 0.8, it means that a randomly chosen true cancellation observation will be ranked higher than a randomly chosen non-cancellation observation by 80%
If we scan only 20% of records but our classifier finds 80% of all responders, we have a lift of?
A) 4
B) 8
C) 6
Interpret the result
Lift = % responders accumulated / % all records scanned
0.8/0.2=4
Our classifier finds responders at 4x the rate of random guess model
Bagging and boosting are two ensemble learning techniques for decision trees. Which of the following statements are NOT true for bagging?
A) multiple instances of the base model (tree) are trained on different subsets of training data
B) subsets are created by random sampling
C) it helps reduce overfitting and variance of the model, increasing robustness and generalisability
D) combines multiple weak learners sequentially, with each tree focusing of correcting the errors of its predecessor
E) trees are trained independently
WRONG:
D) combines multiple weak learners sequentially, with each tree focusing of correcting the errors of its
predecessor – THIS IS THE CASE FOR BOOSTING
TRUE ABOUT BAGGING:
A) multiple instances of the base model (tree) are trained on different subsets of training data
B) subsets are created by random sampling
C) it helps reduce overfitting and variance of the model, increasing robustness and generalisability
E) trees are trained independently
Bagging and boosting are two ensemble learning techniques for decision trees. Which of the following statements are NOT true for boosting?
A) combines multiple weak learners (shallow trees) sequentially, each tree focusing on correcting the error of its predecessor
B) trees are built sequentially, and each subsequent tree gives more weight to instances misclassified by the previous ones
C) the final prediction is a weighted sum of the individual tree predictions
D) it aims at improving accuracy by emphasising instances that are challenging to classify
E) The focus on errors helps the ensemble adapt and perform well on difficult-to-learn patterns in the data.
All statements are correct
A) combines multiple weak learners (shallow trees) sequentially, each tree focusing on correcting the error of its predecessor
B) trees are built sequentially, and each subsequent tree gives more weight to instances misclassified by the previous ones
C) the final prediction is a weighted sum of the individual tree predictions
D) it aims at improving accuracy by emphasising instances that are challenging to classify
E) The focus on errors helps the ensemble adapt and perform well on difficult-to-learn patterns in the data.
Bagging aims to reduce overfitting and increase stability by training trees independently on different subsets.
Boosting focuses on improving accuracy by sequentially training trees, with each tree emphasizing the correction of errors made by the previous ones.
TRUE/FALSE
TRUE
In the context of building a decision tree, what is recursive partitioning? What is entropy and how is it used to build a decision tree? Select all correct statements:
A) entropy is used to measure the impurity/ disorder of a dataset with respect of the target variable (output)
B) entropy is used as a criterion to make decisions about how to split the data when constructing decision tree
C) the goal in DT is to partition the dataset into subsets that are as pure as possible wrt. the target variable
D) the idea is to select the attributes that, when used for splitting, maximally reduces entropy of subsets
E) goal is to get more homogenous subsets wrt. target variable
All options are TRUE
A) entropy is used to measure the impurity/ disorder of a dataset with respect of the target variable (output)
B) entropy is used as a criterion to make decisions about how to split the data when constructing decision tree
C) the goal in DT is to partition the dataset into subsets that are as pure as possible wrt. the target variable
D) the idea is to select the attributes that, when used for splitting, maximally reduces entropy of subsets
E) goal is to get more homogenous subsets wrt. target variable
Which of the following statements are true about recursive partitioning?
A) a method for DTs where data is split into subsets recursively
B) the process continues until each subset becomes sufficiently homogenous
C) it is a method that targets to reduce entropy/ impurity in subsets
D) recursive refers to a process that repeats itself
All are true:
A) a method for DTs where data is split into subsets recursively
B) the process continues until each subset becomes sufficiently homogenous
C) it is a method that targets to reduce entropy/ impurity in subsets
D) recursive refers to a process that repeats itself
Which of the following constitute drawbacks of decision trees?
A) overfitting
B) instability
C) underfitting
D) entropy
A) overfitting
B) instability
One of the drawbacks of decision trees is that they are prone to overfitting. What are the solutions to this?
Hint: this card mentions 4
Solution:
1. Pruning: removing branches from a fully grown tree to reduce complexity and prevent overfitting
2. Minimum leaf samples: set a minimum number of samples required to create a leaf node - prevents very small leaves capturing noise rather than meaningful relations
3. Cross-validation in general: k-fold
4. Limiting number of branches/ tune-length
One of the drawbacks of decision trees is instability. What does this mean?
A) DTs can be unstable bc. small changes in the data can lead to significantly different tree structures
B) makes them sensitive to variations in the training data
C) makes it important to set a seed
TRUE:
A) DTs can be unstable bc. small changes in the data can lead to significantly different tree structures
B) makes them sensitive to variations in the training data
FALSE: seed is only useful for eliminating randomness of results when using the SAME data and SAME model
One of the drawbacks of decision trees is instability. Which of the following are NOT solutions to this?
A) Maximise entropy and minimise precision
B) Ensembling: combining multiple models to create a stronger and more robust model: e.g., bagging and/or boosting
C) Pre-determining the depth of the tree/ number of splits to avoid instability
WRONG: A) Maximise entropy and minimise precision
CORRECT:
B) Ensembling: combining multiple models to create a stronger and more robust model: e.g., bagging and/or boosting
C) Pre-determining the depth of the tree/ number of splits to avoid instability
What are 3 basic steps for generating a binary classification for an observation in the test set?
Hint: you start with probability and end with ranking
- compute the probability of 1 relative to 0 for each observation
- define a cut-off threshold that determines how high the probability shall be in order to be classified as 1 rather than 0
- rank the observations’ probabilities in ascending order to determine 1s and 0s
In which situations would we care about a classifier’s gain/lift performance?
In applications where budget and resources are constrained to only a subset of the observations.
In such cases, you want to know/ rank all observations by the most likely positives, allowing you to take action on the cases that are most likely needing action to be taken on
If I tell you that my classifier has a “1st decile lift” of 1.2, what does that number mean?
You split data into 10 deciles, ranking them by likelihood of TRUE. 1st. decile is that where the likelihood is highest. In this decile, your classifier is better at identifying positives than random guessing by 20% (1.2x)
How do you determine the best splits? Is the following steps described correctly?
A) For each node in the tree, the resulting class impurity is calculated using some measure (often GINI or entropy)
B) The split that maximizes the purity of classes in the resulting partitions is selected.
C) The total entropy is calculated as a weighted average of resulting entropy in partitions.
D) This testing-splitting process continues iteratively until a terminal node is reached.
YES:
A) For each node in the tree, the resulting class impurity is calculated using some measure (often GINI or entropy)
B) The split that maximizes the purity of classes in the resulting partitions is selected.
C) The total entropy is calculated as a weighted average of resulting entropy in partitions.
D) This testing-splitting process continues iteratively until a terminal node is reached.
When a resulting partition or terminal node contains cases of both classes (1 and 0), a class probability is computed for an individual case by looking at the proportion of cases with that outcome class.
TRUE/FALSE
TRUE
A possible extension of the decision tree is random forest. Which of the following statements are TRUE about this method?
A) it leverages the related concepts of bagging (bootstrap aggregating) and boosting individually “weak” learners (i.e., an ensemble) to create a more powerful prediction model
B) the resulting model tends to have better statistical properties such as lower variance
C) it is less complex than decision trees’ basic form
D) it serves as an alternative to linear regression
A) it leverages the related concepts of bagging (bootstrap aggregating) and boosting individually “weak” learners (i.e., an ensemble) to create a more powerful prediction model
B) the resulting model tends to have better statistical properties such as lower variance
Which of the following is NOT an advantage of decision trees?
A) it can handle missing data
B) it is easy to interpret and explain to stakeholders
C) it can be used to automatically find interesting interaction terms and informative predictors
D) it is computationally less demanding than logistic regression models of similar size
WRONG:
D) it is computationally less demanding than logistic regression models of similar size - on the contrary
CORRECT:
A) it can handle missing data
B) it is easy to interpret and explain to stakeholders
C) it can be used to automatically find interesting interaction terms and informative predictors
One of the advantages of decision tress is that it can handle missing data.
Explain this.
During tree construction, only examples without missing data are used.
But at the time of prediction, trees can use “SURROGATE SPLITS” (i.e., similar splits) to classify cases when specific feature values are missing
One of the advantages of decision tress is that it is fairly easy to interpret and explain to stakeholders
Explain this.
In many cases, you can compare the learned rules to human-derived business rules and find interesting associations
One of the advantages of decision tress is that it can be used to automatically find interesting interaction terms and information predictors.
Explain this
Due to the way trees create splits, they can be used to find interesting interaction terms and informative predictors. For example, variables appearing at the top of the tree will tend to carry the most information about the outcome class. You might use this information to build other predictive models with fewer predictors and/or interaction terms.
Which of the following comprise drawbacks to decision tree models?
A) they are prone to overfitting
B) tree construction is greedy
C) they are prone to imbalance
D) they are difficult to visualise meaningfully when number of branches is high
TRUE DRAWBACKS:
A) they are prone to overfitting
B) tree construction is greedy
D) they are difficult to visualise meaningfully when number of branches is high
WRONG:
C) they are prone to imbalance
One of the disadvantages of decision tress is that it is prone to overfitting.
Explain this
Tree algorithms can be very sensitive to features of your particular dataset, increasing the risk of overfitting to random noise present in your particular training set. Consider that in the extreme case, you could construct a deep tree by creating a unique terminal node for each example in the training set.
Pruning involves cutting back branches of the tree to improve its _____ to new samples.
Fill in the blank.
Pruning involves cutting back branches of the tree to improve its GENERALISABILITY to new samples.
I.e., it helps combat overfitting
Cross-validation can help you to make the appropriate pruning decisions, based on criteria like the cost-complexity measure that balances predictive accuracy and tree simplicity.
TRUE/FALSE
TRUE
One of the disadvantages of decision tress is that the tree construction is “greedy”.
Explain this.
Tree construction is greedy: means once a split decision is made, it cannot be reversed. The result is that initial splits have a larger role in determining the structure of the tree, which may not be a good thing if your tree happens to be trained using unrepresentative data.
A hyperparameter is one that is not learned by the data itself, but is tuned by the programmer.
TRUE/ FALSE
TRUE
The cost complexity of a tree is equal to its misclassification error on the training data plus a penalty factor based on the size of the tree. It exhibits the trade-off between the tree’s ability to capture complex relations and its structural complexity which can lead to overfitting.
TRUE/FALSE
TRUE
Pruning works by holdout data performance (e.g., in k-folds cross validation, the fold used to test the learned tree) to remove the “weakest branches” of the overgrown tree. As a process, pruning reduces the size of trees. The basic idea is to trade-off misclassification error in the holdout set against the number of decision nodes in the pruned tree to achieve a good bias-variance balance.
TRUE/FALSE
TRUE