SRM Chapter 5 - Decision Trees Flashcards
5.1 Regression Trees 5.2 Classification Trees 5.3 Multiple Trees
In a decision tree what does the first (top-most) split indicate?
The most important predictor (most influential on the response variable)
Terminal nodes aka tree leaves
- Regions that the tree is split into
- Think endpoints for each branch of decisions
- They end in the predicted values for each branch of decisions
Internal nodes
Intermediate splits along the tree
Stump
- A tree with only one internal node
- Picture no branches at all
Child nodes
- Nodes resulting from a split
- The node above two child nodes is the parent node
Branches
The lines connecting any two nodes
How is a tree partitioned?
Predictor space is partitioned into g regions such that the SSE is minimized.
Recursive binary splitting
Top-down: start with everything in one region and select the next best split each time, working down the tree.
Binary: because two regions are created at each split.
Why is recursive binary splitting considered “greedy”?
- Because it’s a top-down approach
- So, it only considers the next-best decision starting from the top, rather than holistically looking at all branch/split possibilities
Stopping criteria
Criteria for when to stop splitting the predictor space.
e.g. No region allowed to have fewer than some number of observations.
When the predictor space is split and observations are divided, how is the predicted value chosen?
By computing the sample mean of each region (y-bar Rm).
Pruning
Reducing the number of leaves (terminal nodes) in a tree to reduce flexibility, variance and the risk of overfitting (recursive binary splitting is prone to overfitting because it can result in a large and complex tree).
We can do this by creating a sub tree (eliminating an internal node in the tree)
Cost complexity pruning
- Aka weakest link pruning
- Uses a tuning parameter to find a sequence of subtrees, tells us which subtree to consider for each value of the tuning parameter
- Similar to lasso/ridge regression in that the tuning parameter is a penalty term to reduce the flexibility
What does cost complexity pruning result in?
Nested sequences of subtrees
How does cost complexity pruning work?
- Has a tuning parameter (lambda) that acts a penalty term to punish excess flexibility in the model
- Also needs a |T| to offset lambda
- These are inversely related
What does the graph of # of terminal nodes against lambda look like?
- Step function
- Looks like stairs coming down from the left
- i.e. as the tuning parameter increases, the number of leaves decreases
How do we select the best subtree after pruning the tree?
Cross-validation
Choose lambda by applying k-fold cross-validation, and pick the lambda that results in the lowest CV error.
Then the best subtree is the one that has the selected lambda value.
What does |T| denote?
The size of the tree
What is the primary objective of tree pruning?
To find a subtree that minimizes the TEST error rate.
Effect of tree pruning on:
Variance
Flexibility
Bias
Residual sum of squares (SSE)
Number of terminal nodes |T|
Tuning parameter (lambda)
Penalty term (lambda * |T|)
Variance: DOWN
Flexibility: DOWN
Bias: UP
SSE: UP
|T|: DOWN
Lambda: UP
Penalty term: DOWN (with |T|)
Examples of greedy models
Think one-step-at-a-time, without holistic consideration.
- Recursive binary splitting
- Forward stepwise selection
- Backward stepwise selection
For a smaller subtree, what happens?
RSS + # terminal nodes * tuning parameter is minimized for a smaller subtree.
aka RSS + |T|* lambda
Decision Tree Yes/No
Yes = left branch
No = right branch
Impurity function (classification)
Used in classification decision trees to decide the best split (instead of splitting to minimize the error sum of squares)
Node purity (classification)
If all observations in the node are of the same category
i.e. usually we pick the most frequent response for each splitted region, but here we don’t have to because they’re all of the same category.
Impurity measures (classification) (3)
and what do we want them to be for the node to be pure?
- Classification error rate
- Gini index
- Cross entropy
optional 4. Deviance
Want them to be small - if these measures are small then the node is relatively pure
Classification error rate
% of training observations that do not belong to the most frequent category
Em = 1 = max(p-hat)m,c
i.e.
Classification error rate = 1 - # observations in most frequent category
Common log base for classification trees
log base 2, sometimes natural logarithm (see SOA sample problems)
How sensitive is classification error rate as compared to Gini index and cross entropy?
Not as sensitive
i.e. unable to capture improvement in node purity as the other measures
Focus of classification error rate vs. Gini index and cross entropy
Classification error rate focus: misclassified observations
Other two focus: maximizing node purity
This explains why classification error rate is not as sensitive to node purity
If prediction accuracy is the goal, which node purity measure is the best to use for pruning?
Classification error rate
Because:
- results in simpler trees
- lowers variance (the most)