SRM Chapter 5 - Decision Trees Flashcards

5.1 Regression Trees 5.2 Classification Trees 5.3 Multiple Trees

1
Q

In a decision tree what does the first (top-most) split indicate?

A

The most important predictor (most influential on the response variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Terminal nodes aka tree leaves

A
  • Regions that the tree is split into
  • Think endpoints for each branch of decisions
  • They end in the predicted values for each branch of decisions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Internal nodes

A

Intermediate splits along the tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Stump

A
  • A tree with only one internal node
  • Picture no branches at all
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Child nodes

A
  • Nodes resulting from a split
  • The node above two child nodes is the parent node
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Branches

A

The lines connecting any two nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is a tree partitioned?

A

Predictor space is partitioned into g regions such that the SSE is minimized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Recursive binary splitting

A

Top-down: start with everything in one region and select the next best split each time, working down the tree.

Binary: because two regions are created at each split.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is recursive binary splitting considered “greedy”?

A
  • Because it’s a top-down approach
  • So, it only considers the next-best decision starting from the top, rather than holistically looking at all branch/split possibilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Stopping criteria

A

Criteria for when to stop splitting the predictor space.

e.g. No region allowed to have fewer than some number of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When the predictor space is split and observations are divided, how is the predicted value chosen?

A

By computing the sample mean of each region (y-bar Rm).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pruning

A

Reducing the number of leaves (terminal nodes) in a tree to reduce flexibility, variance and the risk of overfitting (recursive binary splitting is prone to overfitting because it can result in a large and complex tree).

We can do this by creating a sub tree (eliminating an internal node in the tree)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cost complexity pruning

A
  • Aka weakest link pruning
  • Uses a tuning parameter to find a sequence of subtrees, tells us which subtree to consider for each value of the tuning parameter
  • Similar to lasso/ridge regression in that the tuning parameter is a penalty term to reduce the flexibility
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does cost complexity pruning result in?

A

Nested sequences of subtrees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does cost complexity pruning work?

A
  • Has a tuning parameter (lambda) that acts a penalty term to punish excess flexibility in the model
  • Also needs a |T| to offset lambda
  • These are inversely related
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the graph of # of terminal nodes against lambda look like?

A
  • Step function
  • Looks like stairs coming down from the left
  • i.e. as the tuning parameter increases, the number of leaves decreases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do we select the best subtree after pruning the tree?

A

Cross-validation

Choose lambda by applying k-fold cross-validation, and pick the lambda that results in the lowest CV error.

Then the best subtree is the one that has the selected lambda value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does |T| denote?

A

The size of the tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the primary objective of tree pruning?

A

To find a subtree that minimizes the TEST error rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Effect of tree pruning on:
Variance
Flexibility
Bias
Residual sum of squares (SSE)
Number of terminal nodes |T|
Tuning parameter (lambda)
Penalty term (lambda * |T|)

A

Variance: DOWN
Flexibility: DOWN
Bias: UP
SSE: UP
|T|: DOWN
Lambda: UP
Penalty term: DOWN (with |T|)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Examples of greedy models

A

Think one-step-at-a-time, without holistic consideration.

  • Recursive binary splitting
  • Forward stepwise selection
  • Backward stepwise selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

For a smaller subtree, what happens?

A

RSS + # terminal nodes * tuning parameter is minimized for a smaller subtree.

aka RSS + |T|* lambda

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Decision Tree Yes/No

A

Yes = left branch
No = right branch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Impurity function (classification)

A

Used in classification decision trees to decide the best split (instead of splitting to minimize the error sum of squares)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Node purity (classification)

A

If all observations in the node are of the same category

i.e. usually we pick the most frequent response for each splitted region, but here we don’t have to because they’re all of the same category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Impurity measures (classification) (3)
and what do we want them to be for the node to be pure?

A
  1. Classification error rate
  2. Gini index
  3. Cross entropy

optional 4. Deviance

Want them to be small - if these measures are small then the node is relatively pure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Classification error rate

A

% of training observations that do not belong to the most frequent category

Em = 1 = max(p-hat)m,c
i.e.
Classification error rate = 1 - # observations in most frequent category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Common log base for classification trees

A

log base 2, sometimes natural logarithm (see SOA sample problems)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How sensitive is classification error rate as compared to Gini index and cross entropy?

A

Not as sensitive
i.e. unable to capture improvement in node purity as the other measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Focus of classification error rate vs. Gini index and cross entropy

A

Classification error rate focus: misclassified observations

Other two focus: maximizing node purity

This explains why classification error rate is not as sensitive to node purity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

If prediction accuracy is the goal, which node purity measure is the best to use for pruning?

A

Classification error rate
Because:
- results in simpler trees
- lowers variance (the most)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What shapes are the graphs of classification error rate, gini index and cross entropy? (When plotting Im against p-hatm,1)

A

Classification error rate: Triangle with one point up, two down -> shows how less sensitive to improvement in node purity
Cross entropy: inverted U
Gini index: inverted U

33
Q

What does a relatively low residual mean deviance indicate?

A

The decision tree fits the training data well.

34
Q

Misclassification error rate

A

Classification error rate for the whole tree (how many observations are misclassified across all branches)

35
Q

Advantages of decision trees (5)

A
  1. Interpretability/
    explainability (easy)
  2. Can be presented visually
  3. Can handle categorical variables without using dummy variables
  4. Mimic human decision-making better than other models
  5. Can handle interactions between features without having to add special features like in linear models
36
Q

Disadvantages of decision trees (2)

A
  1. Not robust -> varying results between subsets
  2. Don’t have the same predictive accuracy that stat methods do
37
Q

Principle used to choose best split at each node of a tree

A

Minimizing overall variability within each of the resulting nodes

(Fundamental goal is to produce the purest nodes possible)

38
Q

What is cross-validation used for?

A

Evaluating model performance AFTER raining (not during the decision process)

39
Q

Gini index

A

Measures total variance across K classes

(sometimes referred to as a metric for total variance)

40
Q

If a node is pure, what are the entropy and classification error rate values?

A

0 or 1

41
Q

If there are only 2 possible categories what is the entropy relative to classification error rate?

A

Entropy > classification error rate

42
Q

Effect of pruning on classification tree:

  • Training error rate
  • Test error rate
A

Training error rate: INCREASE

Test error rate: MIXED effects
- DECREASE if it successfully reduces overfitting
- INCREASE if an important split is removed

43
Q

Preferred criterion for GROWING trees

A

NOT classification error because it’s insensitive to node purity

Rather pick Gini index or cross entropy, which are focused on maximizing node purity

44
Q

Preferred criterion for PRUNING trees

A

YES classification error rate because it is focused on predictive accuracy, over Gini or entropy criteria.

45
Q

Ensemble methods examples (3)

A
  1. Bagging
  2. Random Forests
  3. Boosting
46
Q

Ensemble methods

A

Combine multiple ‘weak learner’ (basic) models into a stronger predictive model.

47
Q

Bootstrapping

A

Create bootstrap samples (with replacement) -> artificial sets of observations from one set.

Make predictions on each bootstrap sample and combine the results across all samples.

48
Q

Assumption of bootstrap sampling

A

Each bootstrap sample must have the same size as the original sample

49
Q

How many bootstrap samples can be created from n observations?

A

n^n

or

(2n-1
n-1) distinct samples (nCr)

50
Q

Bagging

A

Aka bootstrap aggregation
Approach used to reduce variance in f-hat.

51
Q

Bagging steps

A
  1. Create b bootstrap samples.
  2. Construct a decision tree for each bootstrap sample (use recursive binary splitting).
  3. Predict response by:
    - averaging predictions (regression)
    - using most frequent category (classification)
    (across all b bagged trees)
52
Q

Bagged trees

A

Decision trees created for each bootstrap sample using recursive binary splitting

53
Q

Is b (number of trees) a tuning parameter?

A

NO.
As such a really large b does not lead to overfitting

54
Q

Advantage of bagging

A

Reduces variance

55
Q

Disadvantage of bagging

A

Difficult to interpret the whole thing (can’t illustrate all b trees with just one tree)

56
Q

Out-of-bag error

A

Estimate of the test error of a bagged model

57
Q

Out-of-bag (OOB) observations

A

Observations not used to train a particular bagged tree.

58
Q

About how many of the observations are used to train each bagged tree?

A

2/3rds of the original observations

59
Q

Calculation steps for OOB error

A
  1. Predict the response for each OOB observation, for each bagged tree.
  2. Obtain a single prediction from these
    - regression = average
    - classification = most frequent
  3. Compute the OOB error
    - regression = test MSE formula
    - classification = test error rate formula
60
Q

What does the test error and OOB error look like on a graph?

A

Y axis: error
X axis: b (# bagged trees)

  • both errors follow a similar pattern
  • both stabilize after a certain value of b (at some point increasing the number of bagged trees doesn’t help the model any more).
  • both kinda look like a rounded L shape (error reduces gradually until it stabilizes)
  • neither are U shaped because they have nothing to do with flexibility
61
Q

Ways to measure variable importance in bagged model (2)

A
  1. Find mean decrease in accuracy of OOB predictions when that variable is excluded from the model
    (if decrease in accuracy is large, want the variable included?)
  2. Find the mean decrease in node impurity from splits over that variable
    (if decrease in impurity is large, want the variable included?)
62
Q

Random forests

A

Similar to bagging but not all p explanatory variables are considered.

When all p explanatory variables are considered the model tends to choose the most important variable as the top split, so the trees might end up looking the same (they will be correlated).

TLDR random forests de-correlate trees.

Important: random forests select a random subset of predictors AT EACH SPLIT

63
Q

Relationship between random forest and bagging

A

Bagged models are random forests for k (subset variables) = p (total predictors)

64
Q

When should you use a small k value (# of subset variables)?

A

For datasets with a large number of correlated variables

65
Q

Boosting

A

Grows decision trees sequentially by using info from other trees. New trees are grown by predicting the residuals of the previous tree.

66
Q

Boosting steps

A
  1. Choose number of splits (small)
  2. Create multiple trees of that size (small) where each tree is dependent on the previous
67
Q

Boosting steps generalized

A
  1. For k = 1, 2, …, b:
    (a) Use recursive binary splitting to fit a tree w/ d splits to the data w/ zk as the response.

(b) Update zk by subtracting lambda * f-hatk(x-matrix)

68
Q

Tuning parameters for boosting (3)

A
  1. Number of trees (b)
  2. Number of splits in each tree (d)
  3. Shrinkage parameter (lambda) between 0 and 1
69
Q

Interaction depth

A

d
(Number of splits in each tree)

  • Controls complexity
  • Often d = 1 works well (single split, stump tree)
70
Q

What happens when d = 1?

A
  • Boosted model is an additive model
  • Stump tree (only one split)
71
Q

Partial dependence plots

A

Show predicted response as a function of one variable when the others are averaged out.

72
Q

Lambda in the context of boosting

A
  • Shrinkage parameter (tuning)
  • Controls rate at which boosting learns (0 = slow, 1 = fast)
73
Q

Does bagging have pruning?

A

NO
Trees are grown deep but not pruned, variance is high for each tree.

Instead, averaging the trees is how variance is reduced in this model

74
Q

When is OOB error virtually equivalent to LOOCV error?

A

When there is a sufficiently large number of bootstrap samples.

75
Q

What does a graph of test errors look like for each of bagging, boosting, random forest models?

A
  • Bagging and random forest will look similar (because bagging is a type of random forest)
  • Bagging and random forest look like a rounded L shape (decreasing and eventually levelling off)
  • Random forest has a lower test error than bagging because it’s better (randomization, de-correlation of trees rather than most important predictor monopoly)
76
Q

Boosting: if we have a small lambda, what size B do we need?

A
  • Small lambda indicates the model learns slowly
  • Need large value of B to perform well
  • Inverse relationship
77
Q

Which of the three (bagging, random forest, boosting) can overfit?

A
  • Boosting can overfit if B is too large (but occurs slowly if at all).
  • Cross-validation is used to select B.
78
Q

What happens when bagging is applied to time series models? Why?

A

It can reduce the models’ effectiveness because time series observations have a dependent structure