SRM Chapter 5 - Decision Trees Flashcards

5.1 Regression Trees 5.2 Classification Trees 5.3 Multiple Trees

1
Q

In a decision tree what does the first (top-most) split indicate?

A

The most important predictor (most influential on the response variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Terminal nodes aka tree leaves

A
  • Regions that the tree is split into
  • Think endpoints for each branch of decisions
  • They end in the predicted values for each branch of decisions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Internal nodes

A

Intermediate splits along the tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Stump

A
  • A tree with only one internal node
  • Picture no branches at all
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Child nodes

A
  • Nodes resulting from a split
  • The node above two child nodes is the parent node
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Branches

A

The lines connecting any two nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is a tree partitioned?

A

Predictor space is partitioned into g regions such that the SSE is minimized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Recursive binary splitting

A

Top-down: start with everything in one region and select the next best split each time, working down the tree.

Binary: because two regions are created at each split.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is recursive binary splitting considered “greedy”?

A
  • Because it’s a top-down approach
  • So, it only considers the next-best decision starting from the top, rather than holistically looking at all branch/split possibilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Stopping criteria

A

Criteria for when to stop splitting the predictor space.

e.g. No region allowed to have fewer than some number of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When the predictor space is split and observations are divided, how is the predicted value chosen?

A

By computing the sample mean of each region (y-bar Rm).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pruning

A

Reducing the number of leaves (terminal nodes) in a tree to reduce flexibility, variance and the risk of overfitting (recursive binary splitting is prone to overfitting because it can result in a large and complex tree).

We can do this by creating a sub tree (eliminating an internal node in the tree)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cost complexity pruning

A
  • Aka weakest link pruning
  • Uses a tuning parameter to find a sequence of subtrees, tells us which subtree to consider for each value of the tuning parameter
  • Similar to lasso/ridge regression in that the tuning parameter is a penalty term to reduce the flexibility
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does cost complexity pruning result in?

A

Nested sequences of subtrees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does cost complexity pruning work?

A
  • Has a tuning parameter (lambda) that acts a penalty term to punish excess flexibility in the model
  • Also needs a |T| to offset lambda
  • These are inversely related
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the graph of # of terminal nodes against lambda look like?

A
  • Step function
  • Looks like stairs coming down from the left
  • i.e. as the tuning parameter increases, the number of leaves decreases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do we select the best subtree after pruning the tree?

A

Cross-validation

Choose lambda by applying k-fold cross-validation, and pick the lambda that results in the lowest CV error.

Then the best subtree is the one that has the selected lambda value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does |T| denote?

A

The size of the tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the primary objective of tree pruning?

A

To find a subtree that minimizes the TEST error rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Effect of tree pruning on:
Variance
Flexibility
Bias
Residual sum of squares (SSE)
Number of terminal nodes |T|
Tuning parameter (lambda)
Penalty term (lambda * |T|)

A

Variance: DOWN
Flexibility: DOWN
Bias: UP
SSE: UP
|T|: DOWN
Lambda: UP
Penalty term: DOWN (with |T|)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Examples of greedy models

A

Think one-step-at-a-time, without holistic consideration.

  • Recursive binary splitting
  • Forward stepwise selection
  • Backward stepwise selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

For a smaller subtree, what happens?

A

RSS + # terminal nodes * tuning parameter is minimized for a smaller subtree.

aka RSS + |T|* lambda

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Decision Tree Yes/No

A

Yes = left branch
No = right branch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Impurity function (classification)

A

Used in classification decision trees to decide the best split (instead of splitting to minimize the error sum of squares)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Node purity (classification)
If all observations in the node are of the same category i.e. usually we pick the most frequent response for each splitted region, but here we don't have to because they're all of the same category.
26
Impurity measures (classification) (3) and what do we want them to be for the node to be pure?
1. Classification error rate 2. Gini index 3. Cross entropy optional 4. Deviance Want them to be small - if these measures are small then the node is relatively pure
27
Classification error rate
% of training observations that do not belong to the most frequent category Em = 1 = max(p-hat)m,c i.e. Classification error rate = 1 - # observations in most frequent category
28
Common log base for classification trees
log base 2, sometimes natural logarithm (see SOA sample problems)
29
How sensitive is classification error rate as compared to Gini index and cross entropy?
Not as sensitive i.e. unable to capture improvement in node purity as the other measures
30
Focus of classification error rate vs. Gini index and cross entropy
Classification error rate focus: misclassified observations Other two focus: maximizing node purity This explains why classification error rate is not as sensitive to node purity
31
If prediction accuracy is the goal, which node purity measure is the best to use for pruning?
Classification error rate Because: - results in simpler trees - lowers variance (the most)
32
What shapes are the graphs of classification error rate, gini index and cross entropy? (When plotting Im against p-hatm,1)
Classification error rate: Triangle with one point up, two down -> shows how less sensitive to improvement in node purity Cross entropy: inverted U Gini index: inverted U
33
What does a relatively low residual mean deviance indicate?
The decision tree fits the training data well.
34
Misclassification error rate
Classification error rate for the whole tree (how many observations are misclassified across all branches)
35
Advantages of decision trees (5)
1. Interpretability/ explainability (easy) 2. Can be presented visually 3. Can handle categorical variables without using dummy variables 4. Mimic human decision-making better than other models 5. Can handle interactions between features without having to add special features like in linear models
36
Disadvantages of decision trees (2)
1. Not robust -> varying results between subsets 2. Don't have the same predictive accuracy that stat methods do
37
Principle used to choose best split at each node of a tree
Minimizing overall variability within each of the resulting nodes (Fundamental goal is to produce the purest nodes possible)
38
What is cross-validation used for?
Evaluating model performance AFTER raining (not during the decision process)
39
Gini index
Measures total variance across K classes (sometimes referred to as a metric for total variance)
40
If a node is pure, what are the entropy and classification error rate values?
0 or 1
41
If there are only 2 possible categories what is the entropy relative to classification error rate?
Entropy > classification error rate
42
Effect of pruning on classification tree: - Training error rate - Test error rate
Training error rate: INCREASE Test error rate: MIXED effects - DECREASE if it successfully reduces overfitting - INCREASE if an important split is removed
43
Preferred criterion for GROWING trees
NOT classification error because it's insensitive to node purity Rather pick Gini index or cross entropy, which are focused on maximizing node purity
44
Preferred criterion for PRUNING trees
YES classification error rate because it is focused on predictive accuracy, over Gini or entropy criteria.
45
Ensemble methods examples (3)
1. Bagging 2. Random Forests 3. Boosting
46
Ensemble methods
Combine multiple 'weak learner' (basic) models into a stronger predictive model.
47
Bootstrapping
Create bootstrap samples (with replacement) -> artificial sets of observations from one set. Make predictions on each bootstrap sample and combine the results across all samples.
48
Assumption of bootstrap sampling
Each bootstrap sample must have the same size as the original sample
49
How many bootstrap samples can be created from n observations?
n^n or (2n-1 n-1) distinct samples (nCr)
50
Bagging
Aka bootstrap aggregation Approach used to reduce variance in f-hat.
51
Bagging steps
1. Create b bootstrap samples. 2. Construct a decision tree for each bootstrap sample (use recursive binary splitting). 3. Predict response by: - averaging predictions (regression) - using most frequent category (classification) (across all b bagged trees)
52
Bagged trees
Decision trees created for each bootstrap sample using recursive binary splitting
53
Is b (number of trees) a tuning parameter?
NO. As such a really large b does not lead to overfitting
54
Advantage of bagging
Reduces variance
55
Disadvantage of bagging
Difficult to interpret the whole thing (can't illustrate all b trees with just one tree)
56
Out-of-bag error
Estimate of the test error of a bagged model
57
Out-of-bag (OOB) observations
Observations not used to train a particular bagged tree.
58
About how many of the observations are used to train each bagged tree?
2/3rds of the original observations
59
Calculation steps for OOB error
1. Predict the response for each OOB observation, for each bagged tree. 2. Obtain a single prediction from these - regression = average - classification = most frequent 3. Compute the OOB error - regression = test MSE formula - classification = test error rate formula
60
What does the test error and OOB error look like on a graph?
Y axis: error X axis: b (# bagged trees) - both errors follow a similar pattern - both stabilize after a certain value of b (at some point increasing the number of bagged trees doesn't help the model any more). - both kinda look like a rounded L shape (error reduces gradually until it stabilizes) - neither are U shaped because they have nothing to do with flexibility
61
Ways to measure variable importance in bagged model (2)
1. Find mean decrease in accuracy of OOB predictions when that variable is excluded from the model (if decrease in accuracy is large, want the variable included?) 2. Find the mean decrease in node impurity from splits over that variable (if decrease in impurity is large, want the variable included?)
62
Random forests
Similar to bagging but not all p explanatory variables are considered. When all p explanatory variables are considered the model tends to choose the most important variable as the top split, so the trees might end up looking the same (they will be correlated). TLDR random forests de-correlate trees. Important: random forests select a random subset of predictors *AT EACH SPLIT*
63
Relationship between random forest and bagging
Bagged models are random forests for k (subset variables) = p (total predictors)
64
When should you use a small k value (# of subset variables)?
For datasets with a large number of correlated variables
65
Boosting
Grows decision trees sequentially by using info from other trees. New trees are grown by predicting the residuals of the previous tree.
66
Boosting steps
1. Choose number of splits (small) 2. Create multiple trees of that size (small) where each tree is dependent on the previous
67
Boosting steps generalized
1. For k = 1, 2, ..., b: (a) Use recursive binary splitting to fit a tree w/ d splits to the data w/ zk as the response. (b) Update zk by subtracting lambda * f-hatk(x-matrix)
68
Tuning parameters for boosting (3)
1. Number of trees (b) 2. Number of splits in each tree (d) 3. Shrinkage parameter (lambda) between 0 and 1
69
Interaction depth
d (Number of splits in each tree) - Controls complexity - Often d = 1 works well (single split, stump tree)
70
What happens when d = 1?
- Boosted model is an additive model - Stump tree (only one split)
71
Partial dependence plots
Show predicted response as a function of one variable when the others are averaged out.
72
Lambda in the context of boosting
- Shrinkage parameter (tuning) - Controls rate at which boosting learns (0 = slow, 1 = fast)
73
Does bagging have pruning?
NO Trees are grown deep but not pruned, variance is high for each tree. Instead, averaging the trees is how variance is reduced in this model
74
When is OOB error virtually equivalent to LOOCV error?
When there is a sufficiently large number of bootstrap samples.
75
What does a graph of test errors look like for each of bagging, boosting, random forest models?
- Bagging and random forest will look similar (because bagging is a type of random forest) - Bagging and random forest look like a rounded L shape (decreasing and eventually levelling off) - Random forest has a lower test error than bagging because it's better (randomization, de-correlation of trees rather than most important predictor monopoly)
76
Boosting: if we have a small lambda, what size B do we need?
- Small lambda indicates the model learns slowly - Need large value of B to perform well - Inverse relationship
77
Which of the three (bagging, random forest, boosting) can overfit?
- Boosting can overfit if B is too large (but occurs slowly if at all). - Cross-validation is used to select B.
78
What happens when bagging is applied to time series models? Why?
It can reduce the models' effectiveness because time series observations have a dependent structure