Chapter 5: Decision Trees Flashcards

Question 1

Q

can decision trees be used for both regression and classification problems?

Question 2

Q

how to make a prediction within a decision tree?

Answer

A

locate the region to which the observation belongs, and use the average or the majority class of the target variable in that region as the prediction

Question 3

Q

what is a node? what are the two types of nodes?

Answer

A

a point on a decision tree that corresponds to a subset of the training data.

root node: node at the top of the tree representing the full dataset
terminal nodes: at the bottom of the tree. an endpoint

Question 4

Q

what is a measure of complexity for a tree?

Answer

A

depth.
the number of tree splits needed to go from the tree’s root to the furthest terminal node.

larger depth = more complex

Question 5

Q

T/F: When making a split in a tree, we separate the training observations into two subgroups according to the value or level of one and only one predictor at a time.

Question 6

Q

every time we make a binary split, there are two related decisions that we have to make:

Answer

A

what predictor to split

2. given the split predictor, the corresponding cutoff value

Question 7

Q

what is the goal for terminal nodes?

Answer

A

to create terminal nodes that contain similar observations (that are easy to predict and analyze)

Question 8

Q

what is the measure of impurity for a regression tree?

Answer

A

RSS
- the lower the RSS, the more homogeneous the target observations are.

write out the equation for the RSS of a node m.

Question 9

Q

what are the three measures of node impurity for classification trees? do we want a large or small value?

Answer

A

classification error rate
gini index
entropy

we want a smaller value = homogenous node (think of classification error rate)

Question 10

Q

in terms of classification trees, what does p[m,k] mean? what is the formula?

Answer

A

it is the proportion of target observations in terminal node m (Rm) that are apart of the kth class

p[m,k] = # of training observations of kth class in Rm / total # of observations in Rm

Question 11

Q

what is the classification error rate? what is the formula?

Answer

A

it is the proportion of misclassified observations in a terminal node m.

E[m] = 1 - max p[m,k]

recall, max p[m,k] is the proportion of observations in Rm belonging to the most common level of the target variable. these observations are correctly classified in that terminal node. Then, the compliment of that is the proportion of observations that are misclassified

Question 12

Q

what does it mean when the classification error rate is close to 0?

Answer

A

this means that almost all of the training observations in Rm are correctly classified

Question 13

Q

what is a common criticism against the classification error rate?

Answer

A

that it is not sensitive enough to node purity in the sense that it depends on the class proportions only through the maximum

Question 14

Q

which two measures capture node impurity more effectively?

Answer

A

gini index and entropy

Question 15

Q

what is the gini index? what is the formula for it?

Answer

A

the gini index is a measure of variance across the K classes of the categorical target variable.

high degree of purity = lower gini index (because more variability = less purity)

write out the formula

Question 16

Q

when does gini index achieve its minimum/maximum value?

Answer

A

max value: if the observations are evenly spread across the K classes

min value: if the p[m,k] =1 for some k

Question 17

Q

what is the formula for entropy?

Answer

A

write it out

the entropy increases with the degree of impurity in the node

Question 18

Q

draw the graph of all 3 impurity measures

Question 19

Q

all three impurity measures behave similarly, when do all three achieve their max/min value?

Answer

A

max: p = 0.5 (when the node is most impure)
min: p = 0, 1 (when the node is most pure)

Question 20

Q

what is recursive binary splitting? what is the first step?

Answer

A

decide on an impurity measure
now we can construct the first binary split with the goal of minimizing this impurity measure.
after the first split, there will be 2 regions in the predictor space. then, we look for the best predictor and corresponding cutoff point for splitting one of the two regions. (could be the same predictor used in the first split)
this continues recursively….

Question 21

Q

T/F: recursive binary splitting is a greedy algorithm.

Answer

A

true: in each step of the building process, we make the split that is currently the best, instead of looking ahead and selecting a split that results in a better tree in a future step

Question 22

Q

how do we control the complexity of a tree?

Answer

A

cost complexity pruning

Question 23

Q

what is cost complexity pruning?

Answer

A

decision trees can often be made too complex, and therefor overfit a model quite easily in an attempt to capture signals as well as noise in the training data

the idea behind cost complexity pruning is that we grow a very large and complex tree, and then prune it back to get a subtree. this algorithm is parameterized by a complexity parameter.

Question 24

Q

in cost complexity pruning, what is the penalized objective function?

Answer

A

relative training error + complexity parameter * # of terminal nodes

Question 25

Q

in cost complexity pruning, what is the relative training error in the penalized objective function?

Answer

A

it is the training error of the tree scaled by the training error of the simplest tree (the tree with no splits)

Question 26

Q

in cost complexity pruning, what is the penalty term?

Answer

A

complexity parameter + # of terminal nodes

this is in place to prevent overfitting. The role played by the complexity parameter is the same as the role played by lambda in regularization

Question 27

Q

cost complexity pruning: what happens if the complexity parameter (cp) =0?

Answer

A

there is no price we have to pay for making the tree complex and the fitted tree is identical to the large, overblown tree without pruning

Question 28

Q

cost complexity pruning: what happens as cp increases from 0 to 1?

Answer

A

the penalty term becomes greater and the tree branches are snipped off the tree retroactively (from bottom to top) to form new, larger terminal nodes.

the squared bias of the tree predictions will increase, but the variance will drop

if cp = 1, then the tree will only have the root node only.

Question 29

Q

cost complexity pruning: what does nested family of trees mean?

Answer

A

if c1 and c2 are values of cp ( c1 < c2 ) then the tree fitted with c2 (the larger value) forms a subset of the tree fitted with c1.
by varying the cp parameter, we can trace the whole family of trees ranging from the largest (most complex) tree to the tree with the root node only

Question 30

Q

cost complexity pruning: when minimizing the cp, the search is performed subject to other tree control parameters. what are they?

Answer

A

minimum observations in a node and maximum depth level

Question 31

Q

cost complexity pruning: can we hyperparameter tune cp? explain how.

Answer

A

yes. using k-fold c-v.

for a given cp value, we fit a tree on all but 1 of the k folds. generate predictions on the left out k gold and measure the prediction performance reflected by the RSS (regression) or classification error rate (classification)
repeat this process for all of the k folds
the cp value that gives rise to the best performance is chosen

Question 32

Q

compare and contrast GLMs and decision trees. what are the 5 areas that we can compare the two?

Answer

A

output
numeric predictors
categorical predictors
interaction
collinearity

Question 33

Q

decision trees vs. GLMs: output.

Answer

A

GLMs:
- produces a closed-form equation showing that the target mean depends on the feature values, with the coefficients indicating the magnitude and the direction of the effects of the features on the target variable

Trees:

a set of if-else splitting rules showing how the target mean varies with the feature values.
no analytic model equation

Question 34

Q

decision trees vs. GLMs: numeric predictors

Answer

A

GLMs:

assumes that the effect of a numeric predictor on the target variable is monotonic and it assigns it a single regression coefficient
non-linear effects are captured by adding higher order terms

Trees:

splits are made separating the training observations into distinct groups according to the ordered values of a numeric predictor
the predicted means (reg) or probabilities (class) behave irregularly as a function of the numeric predictor
no monotonic structure. trees excel in handling non-linear relationships

Question 35

Q

decision trees vs. GLMs: categorical predictors

Answer

A

GLMs:
- categorical predictor has to be binarized and converted to dummy variables. a coefficient is assigned to the dummy variable for each non-baseline level

Trees:

splits are made by directly separating the levels of a categorical predictor into two groups
no binarization needed

Question 36

Q

decision trees vs. GLMs: interaction

Answer

A

GLMs:
- cannot take care of interactions unless they are manually inserted into the equation

Trees:

in the recursive binary splitting algorithm, each subsequent tree split affects only a subset of the feature space.
when the splits are based on different predictors, an interaction effect is automatically modeled.

Question 37

Q

decision trees vs. GLMs: collinearity

Answer

A

GLMs:

vulnerable to the presence of highly correlated predictors
they can inflate the variance of the coefficient estimates and model predictions, losing the ability to interpret the coefficients meaningfully

Trees:
- less of a problem for trees and will not derail the algorithm, even if perfect collinearity exists

Question 38

Q

What are the pros of decision trees, in terms of interpretability?

Answer

A

as long as there aren’t too many terminal nodes, the trees are easy to interpret and explain to non-technical audiences because of the transparent nature of the if/then splits.

can be displayed graphically as well
we can easily see which are the most influential predictors (those that appear at the top splits)

Question 39

Q

What are the pros of decision trees, in terms of non-linear relationships? what about skewed variables?

Answer

A

trees excel in accommodating non-linear relationships.
no polynomial terms need to be supplied. they will not alter the tree anyway, because a split of X < 10 and X^2 < 100 are the same thing

they can take care of skewed variables without the need of variable transformations. again, any monotonic transformation will not make a difference in the tree anyway

Question 40

Q

What are the pros of decision trees, in terms of interactions?

Answer

A

trees automatically recognize interactions between variables. no need to identify them prior to fitting the tree.
- this is why feature generation is not an issue in trees

Question 41

Q

What are the pros of decision trees, in terms of categorical variables?

Answer

A

automatically handled without need for binarization

Question 42

Q

What are the pros of decision trees, in terms of variable selection?

Answer

A

variables are automatically selected by trees. most important variables show at the top of the trees

GLMs retain all variables unless we perform stepwise selection or regularization

Question 43

Q

What are the cons of decision trees, in terms of overfitting?

Answer

A

trees are more prone to overfitting than GLMs and therefore tend to produce unstable predictions with high variance (even with pruning)

Question 44

Q

What are the cons of decision trees, in terms of numeric variables?

Answer

A

to capture the effects of a numeric variable effectively, we need to make the tree splits based on this variable repeatedly. This leads to a complex tree

GLM only needs a single coefficient to be estimated for a numeric variables

Question 45

Q

What are the cons of decision trees, in terms of categorical predictors?

Answer

A

trees tend to favor categorical predictors with a large number of levels over those with few levels. there are many many ways to split a categorical predictor that has a large number of levels.

combining levels prior to fitting a tree can help with this.

Question 46

Q

What are the pros of decision trees, in terms of model diagnosis?

Answer

A

unlike GLMs, there is a lack of model diagnostic tools for decision trees

Question 47

Q

what are ensemble methods?

Answer

A

they allow us to hedge against the instability of decision trees and substantially improve their predictive performance in many cases.

these methods produce multiple base models, instead of relying on a single model and combine the results of these base models to make predictions

Question 48

Q

how are ensemble methods successful in terms of bias and variance?

Answer

A

bias: capturing complex relationships in the data due to the use of multiple base models each working on different parts of the complex relationships
variance: reducing the variability of the predictions produced because the variance of an average is smaller than that of one component.

Question 49

Q

what is the drawback of ensemble methods?

Answer

A

computationally prohibitive to implement and difficult to interpret

Question 50

Q

what is the basic idea behind a random forest?

Answer

A

generating multiple boostrapped samples of the training set and fitting unpruned base trees in parallel independently on each of the bootstrapped training samples. the results from these base trees are then combined to form an overall prediction

bootstrapped = random sampling with replacement

Question 51

Q

what is the trick to use when trying to remember what a random forest is?

Answer

A

remember that a forest has many many trees

Question 52

Q

what is the distinguishing characteristic of a random forest?

Answer

A

randomization at each step.

at each split, a random sample of m predictors is chosen as the split candidates out of the p available features. among these m predictors, the one that contributes the greatest reduction in impurity is used to construct the split.

new random sample of m predictors is considered at each split

Question 53

Q

what is the default choice for m in random forest for classification? for regression?

Answer

A

class: m = sqrt(p)
reg: m = p / 3

Question 54

Q

can we tune m in random forests?

Answer

A

yes using c-v

Question 55

Q

pros (1) and cons (2) to a random forest?

Answer

A

pro:
- much more robust than a single tree. although the B base trees are unpruned and therefore have low bias and high variance, averaging the results of these trees results in a substantial amount of variance reduction. (but bias is never really reduced in random forests)

cons:

interpretability: consist of potentially 100 or thousands of decision trees and can’t really be visualized like a single tree. not as easy to interpret as a coefficient in a GLM or a split in a single tree
computational power: takes a considerably longer time to implement a random forest compared to a base decision tree

Question 56

Q

what principle does boosting work on?

Answer

A

the principle of sequential learning

Question 57

Q

what is the main idea behind boosting?

Answer

A

it creates a sequence of inter-dependent trees using information from previously grown trees.

in each iteration of the algorithm, we fit a tree to the residuals of the preceding tree and a scaled down version of the current tree’s predictions is subtracted from the preceding tree’s residuals to form new residuals (watch srm video on this)

Question 58

Q

what are the two key differences between boosting and random forests?

Answer

A

(in parallel vs. in series) while the base trees in a random forest are fitted in parallel, independently of one another, the fitting of the base trees in a boosted model is done in series
(bias vs. variance) rf address model variance, boosting focuses on reducing model bias

Question 59

Q

what are the two prominent features of the boosting algorithm?

Answer

A

each base tree is fitted to the residuals of the previous base tree, rather than the original target variable. the construction of each base tree in a boosted model strongly depends on the trees that have been grown (unlike random forest)
the prediction produced in the bth round is (shrinkage parameter)*prediction
the reason behind using a shrinkage parameter is to slow down the learning process.

Question 60

Q

what are the pros and cons of boosted trees vs random forest

Answer

A

boosted trees perform better in terms of prediction accuracy than random forests due to the bias reduction. but they are more vulnerable to overfitting

Question 61

Q

what are the pros and cons of boosted trees vs base trees?

Answer

A

gain in prediction performance as a result of the use of boosting comes at the cost of a loss of interpretability and computational efficiency. boosting requires that model fitting be performed many times, which entails significant computational cost

Question 62

Q

what package needs to be installed when creating a tree?

Answer

A

rpart and rpart.plot (so we can get a visual representation)

Question 63

Q

in the rpart function, what are the two choices for the “method” argument?

Answer

A

method = "anova" - regression trees
method = "class" - classification trees

Question 64

Q

what arguments does the rpart() function take, omitting the control parameters?

Answer

A

formula (same as glm)
data (same as glm)
method
control

Answer 62

A

minsplit: min # of observations that must exist in a node in order for a split to be attempted. lower value = more complex tree
minbucket: min # of observations in any terminal node (bucket)

cp (complexity parameter): refers to cp in cost complexity pruning.
- when a split is made, the tree size grows by 1 and the penalty term increases by cp. so, if the split does not decrease the relative training error by at least cp, then the split will not be performed.

maxdepth: max number of branches from the root node to the buckets

Answer 63

A

xval = number of folds used when doing cross-validation
parms
- specific to categorical variables
- describes the impurity measure
parms = list (split = “gini”)
parms = list (split = “information”) - entropy

Answer 64

A

we will get a condensed printout of how all the splits are performed

Answer 65

A

a matrix that shows, for each cutoff value of the complexity parameter, the number of splits of the tree constructed, the relative training error, the cv error (xerror column) and the standard error of the c-v error (xstd)

x = xval = number of folds in c-v

Answer 66

A

the value that gives rise to the smallest c-v error

Answer 67

A

cp.min <= rpartobject $ cptable[which.min( rpartobject$cptable[, “xerror”] ), “CP”]

Answer 68

A

prune( rpartobject, cp = cp.min)

cp.min extracted using the “which.min” method