SRM Chapter 5 - Decision Trees Flashcards

Question

Node purity (classification)

Answer 1

If all observations in the node are of the same category i.e. usually we pick the most frequent response for each splitted region, but here we don't have to because they're all of the same category.

Answer 2

1. Classification error rate 2. Gini index 3. Cross entropy optional 4. Deviance Want them to be small - if these measures are small then the node is relatively pure

Answer 3

% of training observations that do not belong to the most frequent category Em = 1 = max(p-hat)m,c i.e. Classification error rate = 1 - # observations in most frequent category

Answer 4

log base 2, sometimes natural logarithm (see SOA sample problems)

Answer 5

Not as sensitive i.e. unable to capture improvement in node purity as the other measures

Answer 6

Classification error rate focus: misclassified observations Other two focus: maximizing node purity This explains why classification error rate is not as sensitive to node purity

Answer 7

Classification error rate Because: - results in simpler trees - lowers variance (the most)

Answer 8

Classification error rate: Triangle with one point up, two down -> shows how less sensitive to improvement in node purity Cross entropy: inverted U Gini index: inverted U

Answer 9

The decision tree fits the training data well.

Answer 10

Classification error rate for the whole tree (how many observations are misclassified across all branches)

Answer 11

1. Interpretability/ explainability (easy) 2. Can be presented visually 3. Can handle categorical variables without using dummy variables 4. Mimic human decision-making better than other models 5. Can handle interactions between features without having to add special features like in linear models

Answer 12

1. Not robust -> varying results between subsets 2. Don't have the same predictive accuracy that stat methods do

Answer 13

Minimizing overall variability within each of the resulting nodes (Fundamental goal is to produce the purest nodes possible)

Answer 14

Evaluating model performance AFTER raining (not during the decision process)

Answer 15

Measures total variance across K classes (sometimes referred to as a metric for total variance)

Answer 16

Entropy > classification error rate

Answer 17

Training error rate: INCREASE Test error rate: MIXED effects - DECREASE if it successfully reduces overfitting - INCREASE if an important split is removed

Answer 18

NOT classification error because it's insensitive to node purity Rather pick Gini index or cross entropy, which are focused on maximizing node purity

Answer 19

YES classification error rate because it is focused on predictive accuracy, over Gini or entropy criteria.

Answer 20

1. Bagging 2. Random Forests 3. Boosting

Answer 21

Combine multiple 'weak learner' (basic) models into a stronger predictive model.

Answer 22

Create bootstrap samples (with replacement) -> artificial sets of observations from one set. Make predictions on each bootstrap sample and combine the results across all samples.

Answer 23

Each bootstrap sample must have the same size as the original sample

Answer 24

n^n or (2n-1 n-1) distinct samples (nCr)

Answer 25

Aka bootstrap aggregation Approach used to reduce variance in f-hat.

Answer 26

1. Create b bootstrap samples. 2. Construct a decision tree for each bootstrap sample (use recursive binary splitting). 3. Predict response by: - averaging predictions (regression) - using most frequent category (classification) (across all b bagged trees)

Answer 27

Decision trees created for each bootstrap sample using recursive binary splitting

Answer 28

NO. As such a really large b does not lead to overfitting

Answer 29

Reduces variance

Answer 30

Difficult to interpret the whole thing (can't illustrate all b trees with just one tree)

Answer 31

Estimate of the test error of a bagged model

Answer 32

Observations not used to train a particular bagged tree.

Answer 33

2/3rds of the original observations

Answer 34

1. Predict the response for each OOB observation, for each bagged tree. 2. Obtain a single prediction from these - regression = average - classification = most frequent 3. Compute the OOB error - regression = test MSE formula - classification = test error rate formula

Answer 35

Y axis: error X axis: b (# bagged trees) - both errors follow a similar pattern - both stabilize after a certain value of b (at some point increasing the number of bagged trees doesn't help the model any more). - both kinda look like a rounded L shape (error reduces gradually until it stabilizes) - neither are U shaped because they have nothing to do with flexibility

Answer 36

1. Find mean decrease in accuracy of OOB predictions when that variable is excluded from the model (if decrease in accuracy is large, want the variable included?) 2. Find the mean decrease in node impurity from splits over that variable (if decrease in impurity is large, want the variable included?)

Answer 37

Similar to bagging but not all p explanatory variables are considered. When all p explanatory variables are considered the model tends to choose the most important variable as the top split, so the trees might end up looking the same (they will be correlated). TLDR random forests de-correlate trees. Important: random forests select a random subset of predictors *AT EACH SPLIT*

Answer 38

Bagged models are random forests for k (subset variables) = p (total predictors)

Answer 39

For datasets with a large number of correlated variables

Answer 40

Grows decision trees sequentially by using info from other trees. New trees are grown by predicting the residuals of the previous tree.

Answer 41

1. Choose number of splits (small) 2. Create multiple trees of that size (small) where each tree is dependent on the previous

Answer 42

1. For k = 1, 2, ..., b: (a) Use recursive binary splitting to fit a tree w/ d splits to the data w/ zk as the response. (b) Update zk by subtracting lambda * f-hatk(x-matrix)

Answer 43

1. Number of trees (b) 2. Number of splits in each tree (d) 3. Shrinkage parameter (lambda) between 0 and 1

Answer 44

d (Number of splits in each tree) - Controls complexity - Often d = 1 works well (single split, stump tree)

Answer 45

- Boosted model is an additive model - Stump tree (only one split)

Answer 46

Show predicted response as a function of one variable when the others are averaged out.

Answer 47

- Shrinkage parameter (tuning) - Controls rate at which boosting learns (0 = slow, 1 = fast)

Answer 48

NO Trees are grown deep but not pruned, variance is high for each tree. Instead, averaging the trees is how variance is reduced in this model

Answer 49

When there is a sufficiently large number of bootstrap samples.

Answer 50

- Bagging and random forest will look similar (because bagging is a type of random forest) - Bagging and random forest look like a rounded L shape (decreasing and eventually levelling off) - Random forest has a lower test error than bagging because it's better (randomization, de-correlation of trees rather than most important predictor monopoly)

Answer 51

- Small lambda indicates the model learns slowly - Need large value of B to perform well - Inverse relationship

Answer 52

- Boosting can overfit if B is too large (but occurs slowly if at all). - Cross-validation is used to select B.

Answer 53

It can reduce the models' effectiveness because time series observations have a dependent structure

SRM Chapter 5 - Decision Trees Flashcards

5.1 Regression Trees 5.2 Classification Trees 5.3 Multiple Trees (78 cards)