SRM Chapter 5 - Decision Trees Flashcards
5.1 Regression Trees 5.2 Classification Trees 5.3 Multiple Trees
In a decision tree what does the first (top-most) split indicate?
The most important predictor (most influential on the response variable)
Terminal nodes aka tree leaves
- Regions that the tree is split into
- Think endpoints for each branch of decisions
- They end in the predicted values for each branch of decisions
Internal nodes
Intermediate splits along the tree
Stump
- A tree with only one internal node
- Picture no branches at all
Child nodes
- Nodes resulting from a split
- The node above two child nodes is the parent node
Branches
The lines connecting any two nodes
How is a tree partitioned?
Predictor space is partitioned into g regions such that the SSE is minimized.
Recursive binary splitting
Top-down: start with everything in one region and select the next best split each time, working down the tree.
Binary: because two regions are created at each split.
Why is recursive binary splitting considered “greedy”?
- Because it’s a top-down approach
- So, it only considers the next-best decision starting from the top, rather than holistically looking at all branch/split possibilities
Stopping criteria
Criteria for when to stop splitting the predictor space.
e.g. No region allowed to have fewer than some number of observations.
When the predictor space is split and observations are divided, how is the predicted value chosen?
By computing the sample mean of each region (y-bar Rm).
Pruning
Reducing the number of leaves (terminal nodes) in a tree to reduce flexibility, variance and the risk of overfitting (recursive binary splitting is prone to overfitting because it can result in a large and complex tree).
We can do this by creating a sub tree (eliminating an internal node in the tree)
Cost complexity pruning
- Aka weakest link pruning
- Uses a tuning parameter to find a sequence of subtrees, tells us which subtree to consider for each value of the tuning parameter
- Similar to lasso/ridge regression in that the tuning parameter is a penalty term to reduce the flexibility
What does cost complexity pruning result in?
Nested sequences of subtrees
How does cost complexity pruning work?
- Has a tuning parameter (lambda) that acts a penalty term to punish excess flexibility in the model
- Also needs a |T| to offset lambda
- These are inversely related
What does the graph of # of terminal nodes against lambda look like?
- Step function
- Looks like stairs coming down from the left
- i.e. as the tuning parameter increases, the number of leaves decreases
How do we select the best subtree after pruning the tree?
Cross-validation
Choose lambda by applying k-fold cross-validation, and pick the lambda that results in the lowest CV error.
Then the best subtree is the one that has the selected lambda value.
What does |T| denote?
The size of the tree
What is the primary objective of tree pruning?
To find a subtree that minimizes the TEST error rate.
Effect of tree pruning on:
Variance
Flexibility
Bias
Residual sum of squares (SSE)
Number of terminal nodes |T|
Tuning parameter (lambda)
Penalty term (lambda * |T|)
Variance: DOWN
Flexibility: DOWN
Bias: UP
SSE: UP
|T|: DOWN
Lambda: UP
Penalty term: DOWN (with |T|)
Examples of greedy models
Think one-step-at-a-time, without holistic consideration.
- Recursive binary splitting
- Forward stepwise selection
- Backward stepwise selection
For a smaller subtree, what happens?
RSS + # terminal nodes * tuning parameter is minimized for a smaller subtree.
aka RSS + |T|* lambda
Decision Tree Yes/No
Yes = left branch
No = right branch
Impurity function (classification)
Used in classification decision trees to decide the best split (instead of splitting to minimize the error sum of squares)
Node purity (classification)
If all observations in the node are of the same category
i.e. usually we pick the most frequent response for each splitted region, but here we don’t have to because they’re all of the same category.
Impurity measures (classification) (3)
and what do we want them to be for the node to be pure?
- Classification error rate
- Gini index
- Cross entropy
optional 4. Deviance
Want them to be small - if these measures are small then the node is relatively pure
Classification error rate
% of training observations that do not belong to the most frequent category
Em = 1 = max(p-hat)m,c
i.e.
Classification error rate = 1 - # observations in most frequent category
Common log base for classification trees
log base 2, sometimes natural logarithm (see SOA sample problems)
How sensitive is classification error rate as compared to Gini index and cross entropy?
Not as sensitive
i.e. unable to capture improvement in node purity as the other measures
Focus of classification error rate vs. Gini index and cross entropy
Classification error rate focus: misclassified observations
Other two focus: maximizing node purity
This explains why classification error rate is not as sensitive to node purity
If prediction accuracy is the goal, which node purity measure is the best to use for pruning?
Classification error rate
Because:
- results in simpler trees
- lowers variance (the most)
What shapes are the graphs of classification error rate, gini index and cross entropy? (When plotting Im against p-hatm,1)
Classification error rate: Triangle with one point up, two down -> shows how less sensitive to improvement in node purity
Cross entropy: inverted U
Gini index: inverted U
What does a relatively low residual mean deviance indicate?
The decision tree fits the training data well.
Misclassification error rate
Classification error rate for the whole tree (how many observations are misclassified across all branches)
Advantages of decision trees (5)
- Interpretability/
explainability (easy) - Can be presented visually
- Can handle categorical variables without using dummy variables
- Mimic human decision-making better than other models
- Can handle interactions between features without having to add special features like in linear models
Disadvantages of decision trees (2)
- Not robust -> varying results between subsets
- Don’t have the same predictive accuracy that stat methods do
Principle used to choose best split at each node of a tree
Minimizing overall variability within each of the resulting nodes
(Fundamental goal is to produce the purest nodes possible)
What is cross-validation used for?
Evaluating model performance AFTER raining (not during the decision process)
Gini index
Measures total variance across K classes
(sometimes referred to as a metric for total variance)
If a node is pure, what are the entropy and classification error rate values?
0 or 1
If there are only 2 possible categories what is the entropy relative to classification error rate?
Entropy > classification error rate
Effect of pruning on classification tree:
- Training error rate
- Test error rate
Training error rate: INCREASE
Test error rate: MIXED effects
- DECREASE if it successfully reduces overfitting
- INCREASE if an important split is removed
Preferred criterion for GROWING trees
NOT classification error because it’s insensitive to node purity
Rather pick Gini index or cross entropy, which are focused on maximizing node purity
Preferred criterion for PRUNING trees
YES classification error rate because it is focused on predictive accuracy, over Gini or entropy criteria.
Ensemble methods examples (3)
- Bagging
- Random Forests
- Boosting
Ensemble methods
Combine multiple ‘weak learner’ (basic) models into a stronger predictive model.
Bootstrapping
Create bootstrap samples (with replacement) -> artificial sets of observations from one set.
Make predictions on each bootstrap sample and combine the results across all samples.
Assumption of bootstrap sampling
Each bootstrap sample must have the same size as the original sample
How many bootstrap samples can be created from n observations?
n^n
or
(2n-1
n-1) distinct samples (nCr)
Bagging
Aka bootstrap aggregation
Approach used to reduce variance in f-hat.
Bagging steps
- Create b bootstrap samples.
- Construct a decision tree for each bootstrap sample (use recursive binary splitting).
- Predict response by:
- averaging predictions (regression)
- using most frequent category (classification)
(across all b bagged trees)
Bagged trees
Decision trees created for each bootstrap sample using recursive binary splitting
Is b (number of trees) a tuning parameter?
NO.
As such a really large b does not lead to overfitting
Advantage of bagging
Reduces variance
Disadvantage of bagging
Difficult to interpret the whole thing (can’t illustrate all b trees with just one tree)
Out-of-bag error
Estimate of the test error of a bagged model
Out-of-bag (OOB) observations
Observations not used to train a particular bagged tree.
About how many of the observations are used to train each bagged tree?
2/3rds of the original observations
Calculation steps for OOB error
- Predict the response for each OOB observation, for each bagged tree.
- Obtain a single prediction from these
- regression = average
- classification = most frequent - Compute the OOB error
- regression = test MSE formula
- classification = test error rate formula
What does the test error and OOB error look like on a graph?
Y axis: error
X axis: b (# bagged trees)
- both errors follow a similar pattern
- both stabilize after a certain value of b (at some point increasing the number of bagged trees doesn’t help the model any more).
- both kinda look like a rounded L shape (error reduces gradually until it stabilizes)
- neither are U shaped because they have nothing to do with flexibility
Ways to measure variable importance in bagged model (2)
- Find mean decrease in accuracy of OOB predictions when that variable is excluded from the model
(if decrease in accuracy is large, want the variable included?) - Find the mean decrease in node impurity from splits over that variable
(if decrease in impurity is large, want the variable included?)
Random forests
Similar to bagging but not all p explanatory variables are considered.
When all p explanatory variables are considered the model tends to choose the most important variable as the top split, so the trees might end up looking the same (they will be correlated).
TLDR random forests de-correlate trees.
Important: random forests select a random subset of predictors AT EACH SPLIT
Relationship between random forest and bagging
Bagged models are random forests for k (subset variables) = p (total predictors)
When should you use a small k value (# of subset variables)?
For datasets with a large number of correlated variables
Boosting
Grows decision trees sequentially by using info from other trees. New trees are grown by predicting the residuals of the previous tree.
Boosting steps
- Choose number of splits (small)
- Create multiple trees of that size (small) where each tree is dependent on the previous
Boosting steps generalized
- For k = 1, 2, …, b:
(a) Use recursive binary splitting to fit a tree w/ d splits to the data w/ zk as the response.
(b) Update zk by subtracting lambda * f-hatk(x-matrix)
Tuning parameters for boosting (3)
- Number of trees (b)
- Number of splits in each tree (d)
- Shrinkage parameter (lambda) between 0 and 1
Interaction depth
d
(Number of splits in each tree)
- Controls complexity
- Often d = 1 works well (single split, stump tree)
What happens when d = 1?
- Boosted model is an additive model
- Stump tree (only one split)
Partial dependence plots
Show predicted response as a function of one variable when the others are averaged out.
Lambda in the context of boosting
- Shrinkage parameter (tuning)
- Controls rate at which boosting learns (0 = slow, 1 = fast)
Does bagging have pruning?
NO
Trees are grown deep but not pruned, variance is high for each tree.
Instead, averaging the trees is how variance is reduced in this model
When is OOB error virtually equivalent to LOOCV error?
When there is a sufficiently large number of bootstrap samples.
What does a graph of test errors look like for each of bagging, boosting, random forest models?
- Bagging and random forest will look similar (because bagging is a type of random forest)
- Bagging and random forest look like a rounded L shape (decreasing and eventually levelling off)
- Random forest has a lower test error than bagging because it’s better (randomization, de-correlation of trees rather than most important predictor monopoly)
Boosting: if we have a small lambda, what size B do we need?
- Small lambda indicates the model learns slowly
- Need large value of B to perform well
- Inverse relationship
Which of the three (bagging, random forest, boosting) can overfit?
- Boosting can overfit if B is too large (but occurs slowly if at all).
- Cross-validation is used to select B.
What happens when bagging is applied to time series models? Why?
It can reduce the models’ effectiveness because time series observations have a dependent structure