Chapter 5: Decision Trees Flashcards

1
Q

can decision trees be used for both regression and classification problems?

A

yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

how to make a prediction within a decision tree?

A

locate the region to which the observation belongs, and use the average or the majority class of the target variable in that region as the prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is a node? what are the two types of nodes?

A

a point on a decision tree that corresponds to a subset of the training data.

  1. root node: node at the top of the tree representing the full dataset
  2. terminal nodes: at the bottom of the tree. an endpoint
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a measure of complexity for a tree?

A

depth.
the number of tree splits needed to go from the tree’s root to the furthest terminal node.

larger depth = more complex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

T/F: When making a split in a tree, we separate the training observations into two subgroups according to the value or level of one and only one predictor at a time.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

every time we make a binary split, there are two related decisions that we have to make:

A
  1. what predictor to split

2. given the split predictor, the corresponding cutoff value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is the goal for terminal nodes?

A

to create terminal nodes that contain similar observations (that are easy to predict and analyze)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is the measure of impurity for a regression tree?

A

RSS
- the lower the RSS, the more homogeneous the target observations are.

write out the equation for the RSS of a node m.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what are the three measures of node impurity for classification trees? do we want a large or small value?

A
  1. classification error rate
  2. gini index
  3. entropy

we want a smaller value = homogenous node (think of classification error rate)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

in terms of classification trees, what does p[m,k] mean? what is the formula?

A

it is the proportion of target observations in terminal node m (Rm) that are apart of the kth class

p[m,k] = # of training observations of kth class in Rm / total # of observations in Rm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the classification error rate? what is the formula?

A

it is the proportion of misclassified observations in a terminal node m.

E[m] = 1 - max p[m,k]

recall, max p[m,k] is the proportion of observations in Rm belonging to the most common level of the target variable. these observations are correctly classified in that terminal node. Then, the compliment of that is the proportion of observations that are misclassified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what does it mean when the classification error rate is close to 0?

A

this means that almost all of the training observations in Rm are correctly classified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is a common criticism against the classification error rate?

A

that it is not sensitive enough to node purity in the sense that it depends on the class proportions only through the maximum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

which two measures capture node impurity more effectively?

A

gini index and entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is the gini index? what is the formula for it?

A

the gini index is a measure of variance across the K classes of the categorical target variable.

high degree of purity = lower gini index (because more variability = less purity)

write out the formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

when does gini index achieve its minimum/maximum value?

A

max value: if the observations are evenly spread across the K classes

min value: if the p[m,k] =1 for some k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is the formula for entropy?

A

write it out

the entropy increases with the degree of impurity in the node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

draw the graph of all 3 impurity measures

A

draw

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

all three impurity measures behave similarly, when do all three achieve their max/min value?

A

max: p = 0.5 (when the node is most impure)
min: p = 0, 1 (when the node is most pure)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is recursive binary splitting? what is the first step?

A
  1. decide on an impurity measure
  2. now we can construct the first binary split with the goal of minimizing this impurity measure.
  3. after the first split, there will be 2 regions in the predictor space. then, we look for the best predictor and corresponding cutoff point for splitting one of the two regions. (could be the same predictor used in the first split)
  4. this continues recursively….
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

T/F: recursive binary splitting is a greedy algorithm.

A

true: in each step of the building process, we make the split that is currently the best, instead of looking ahead and selecting a split that results in a better tree in a future step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

how do we control the complexity of a tree?

A

cost complexity pruning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is cost complexity pruning?

A

decision trees can often be made too complex, and therefor overfit a model quite easily in an attempt to capture signals as well as noise in the training data

the idea behind cost complexity pruning is that we grow a very large and complex tree, and then prune it back to get a subtree. this algorithm is parameterized by a complexity parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

in cost complexity pruning, what is the penalized objective function?

A

relative training error + complexity parameter * # of terminal nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

in cost complexity pruning, what is the relative training error in the penalized objective function?

A

it is the training error of the tree scaled by the training error of the simplest tree (the tree with no splits)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

in cost complexity pruning, what is the penalty term?

A

complexity parameter + # of terminal nodes

this is in place to prevent overfitting. The role played by the complexity parameter is the same as the role played by lambda in regularization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

cost complexity pruning: what happens if the complexity parameter (cp) =0?

A

there is no price we have to pay for making the tree complex and the fitted tree is identical to the large, overblown tree without pruning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

cost complexity pruning: what happens as cp increases from 0 to 1?

A

the penalty term becomes greater and the tree branches are snipped off the tree retroactively (from bottom to top) to form new, larger terminal nodes.

the squared bias of the tree predictions will increase, but the variance will drop

if cp = 1, then the tree will only have the root node only.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

cost complexity pruning: what does nested family of trees mean?

A

if c1 and c2 are values of cp ( c1 < c2 ) then the tree fitted with c2 (the larger value) forms a subset of the tree fitted with c1.
by varying the cp parameter, we can trace the whole family of trees ranging from the largest (most complex) tree to the tree with the root node only

30
Q

cost complexity pruning: when minimizing the cp, the search is performed subject to other tree control parameters. what are they?

A

minimum observations in a node and maximum depth level

31
Q

cost complexity pruning: can we hyperparameter tune cp? explain how.

A

yes. using k-fold c-v.

  1. for a given cp value, we fit a tree on all but 1 of the k folds. generate predictions on the left out k gold and measure the prediction performance reflected by the RSS (regression) or classification error rate (classification)
  2. repeat this process for all of the k folds
  3. the cp value that gives rise to the best performance is chosen
32
Q

compare and contrast GLMs and decision trees. what are the 5 areas that we can compare the two?

A
output
numeric predictors
categorical predictors
interaction
collinearity
33
Q

decision trees vs. GLMs: output.

A

GLMs:
- produces a closed-form equation showing that the target mean depends on the feature values, with the coefficients indicating the magnitude and the direction of the effects of the features on the target variable

Trees:

  • a set of if-else splitting rules showing how the target mean varies with the feature values.
  • no analytic model equation
34
Q

decision trees vs. GLMs: numeric predictors

A

GLMs:

  • assumes that the effect of a numeric predictor on the target variable is monotonic and it assigns it a single regression coefficient
  • non-linear effects are captured by adding higher order terms

Trees:

  • splits are made separating the training observations into distinct groups according to the ordered values of a numeric predictor
  • the predicted means (reg) or probabilities (class) behave irregularly as a function of the numeric predictor
  • no monotonic structure. trees excel in handling non-linear relationships
35
Q

decision trees vs. GLMs: categorical predictors

A

GLMs:
- categorical predictor has to be binarized and converted to dummy variables. a coefficient is assigned to the dummy variable for each non-baseline level

Trees:

  • splits are made by directly separating the levels of a categorical predictor into two groups
  • no binarization needed
36
Q

decision trees vs. GLMs: interaction

A

GLMs:
- cannot take care of interactions unless they are manually inserted into the equation

Trees:

  • in the recursive binary splitting algorithm, each subsequent tree split affects only a subset of the feature space.
  • when the splits are based on different predictors, an interaction effect is automatically modeled.
37
Q

decision trees vs. GLMs: collinearity

A

GLMs:

  • vulnerable to the presence of highly correlated predictors
  • they can inflate the variance of the coefficient estimates and model predictions, losing the ability to interpret the coefficients meaningfully

Trees:
- less of a problem for trees and will not derail the algorithm, even if perfect collinearity exists

38
Q

What are the pros of decision trees, in terms of interpretability?

A

as long as there aren’t too many terminal nodes, the trees are easy to interpret and explain to non-technical audiences because of the transparent nature of the if/then splits.

  • can be displayed graphically as well
  • we can easily see which are the most influential predictors (those that appear at the top splits)
39
Q

What are the pros of decision trees, in terms of non-linear relationships? what about skewed variables?

A

trees excel in accommodating non-linear relationships.
no polynomial terms need to be supplied. they will not alter the tree anyway, because a split of X < 10 and X^2 < 100 are the same thing

they can take care of skewed variables without the need of variable transformations. again, any monotonic transformation will not make a difference in the tree anyway

40
Q

What are the pros of decision trees, in terms of interactions?

A

trees automatically recognize interactions between variables. no need to identify them prior to fitting the tree.
- this is why feature generation is not an issue in trees

41
Q

What are the pros of decision trees, in terms of categorical variables?

A

automatically handled without need for binarization

42
Q

What are the pros of decision trees, in terms of variable selection?

A

variables are automatically selected by trees. most important variables show at the top of the trees

GLMs retain all variables unless we perform stepwise selection or regularization

43
Q

What are the cons of decision trees, in terms of overfitting?

A

trees are more prone to overfitting than GLMs and therefore tend to produce unstable predictions with high variance (even with pruning)

44
Q

What are the cons of decision trees, in terms of numeric variables?

A

to capture the effects of a numeric variable effectively, we need to make the tree splits based on this variable repeatedly. This leads to a complex tree

GLM only needs a single coefficient to be estimated for a numeric variables

45
Q

What are the cons of decision trees, in terms of categorical predictors?

A

trees tend to favor categorical predictors with a large number of levels over those with few levels. there are many many ways to split a categorical predictor that has a large number of levels.

combining levels prior to fitting a tree can help with this.

46
Q

What are the pros of decision trees, in terms of model diagnosis?

A

unlike GLMs, there is a lack of model diagnostic tools for decision trees

47
Q

what are ensemble methods?

A

they allow us to hedge against the instability of decision trees and substantially improve their predictive performance in many cases.

these methods produce multiple base models, instead of relying on a single model and combine the results of these base models to make predictions

48
Q

how are ensemble methods successful in terms of bias and variance?

A

bias: capturing complex relationships in the data due to the use of multiple base models each working on different parts of the complex relationships
variance: reducing the variability of the predictions produced because the variance of an average is smaller than that of one component.

49
Q

what is the drawback of ensemble methods?

A

computationally prohibitive to implement and difficult to interpret

50
Q

what is the basic idea behind a random forest?

A

generating multiple boostrapped samples of the training set and fitting unpruned base trees in parallel independently on each of the bootstrapped training samples. the results from these base trees are then combined to form an overall prediction

bootstrapped = random sampling with replacement

51
Q

what is the trick to use when trying to remember what a random forest is?

A

remember that a forest has many many trees

52
Q

what is the distinguishing characteristic of a random forest?

A

randomization at each step.

at each split, a random sample of m predictors is chosen as the split candidates out of the p available features. among these m predictors, the one that contributes the greatest reduction in impurity is used to construct the split.

new random sample of m predictors is considered at each split

53
Q

what is the default choice for m in random forest for classification? for regression?

A

class: m = sqrt(p)
reg: m = p / 3

54
Q

can we tune m in random forests?

A

yes using c-v

55
Q

pros (1) and cons (2) to a random forest?

A

pro:
- much more robust than a single tree. although the B base trees are unpruned and therefore have low bias and high variance, averaging the results of these trees results in a substantial amount of variance reduction. (but bias is never really reduced in random forests)

cons:

  1. interpretability: consist of potentially 100 or thousands of decision trees and can’t really be visualized like a single tree. not as easy to interpret as a coefficient in a GLM or a split in a single tree
  2. computational power: takes a considerably longer time to implement a random forest compared to a base decision tree
56
Q

what principle does boosting work on?

A

the principle of sequential learning

57
Q

what is the main idea behind boosting?

A

it creates a sequence of inter-dependent trees using information from previously grown trees.

in each iteration of the algorithm, we fit a tree to the residuals of the preceding tree and a scaled down version of the current tree’s predictions is subtracted from the preceding tree’s residuals to form new residuals (watch srm video on this)

58
Q

what are the two key differences between boosting and random forests?

A
  1. (in parallel vs. in series) while the base trees in a random forest are fitted in parallel, independently of one another, the fitting of the base trees in a boosted model is done in series
  2. (bias vs. variance) rf address model variance, boosting focuses on reducing model bias
59
Q

what are the two prominent features of the boosting algorithm?

A
  1. each base tree is fitted to the residuals of the previous base tree, rather than the original target variable. the construction of each base tree in a boosted model strongly depends on the trees that have been grown (unlike random forest)
  2. the prediction produced in the bth round is (shrinkage parameter)*prediction
    the reason behind using a shrinkage parameter is to slow down the learning process.
60
Q

what are the pros and cons of boosted trees vs random forest

A

boosted trees perform better in terms of prediction accuracy than random forests due to the bias reduction. but they are more vulnerable to overfitting

61
Q

what are the pros and cons of boosted trees vs base trees?

A

gain in prediction performance as a result of the use of boosting comes at the cost of a loss of interpretability and computational efficiency. boosting requires that model fitting be performed many times, which entails significant computational cost

62
Q

what package needs to be installed when creating a tree?

A

rpart and rpart.plot (so we can get a visual representation)

63
Q

in the rpart function, what are the two choices for the “method” argument?

A
method = "anova" - regression trees
method = "class" - classification trees
64
Q

what arguments does the rpart() function take, omitting the control parameters?

A
  1. formula (same as glm)
  2. data (same as glm)
  3. method
  4. control
65
Q

when creating a tree, what arguments are meant to be used in the
< control = rpart.control() > function?

A

minsplit: min # of observations that must exist in a node in order for a split to be attempted. lower value = more complex tree
minbucket: min # of observations in any terminal node (bucket)

cp (complexity parameter): refers to cp in cost complexity pruning.
- when a split is made, the tree size grows by 1 and the penalty term increases by cp. so, if the split does not decrease the relative training error by at least cp, then the split will not be performed.

maxdepth: max number of branches from the root node to the buckets

66
Q

what are the other parameters of the rpart() function not related to complexity?

A
  1. xval = number of folds used when doing cross-validation
  2. parms
    - specific to categorical variables
    - describes the impurity measure
    parms = list (split = “gini”)
    parms = list (split = “information”) - entropy
67
Q

what happens if we type the name of an rpart object?

A

we will get a condensed printout of how all the splits are performed

68
Q

what happens when we extract the cptable component of an rpart object?

A

a matrix that shows, for each cutoff value of the complexity parameter, the number of splits of the tree constructed, the relative training error, the cv error (xerror column) and the standard error of the c-v error (xstd)

x = xval = number of folds in c-v

69
Q

how do you choose the best cp based on the cptable?

A

the value that gives rise to the smallest c-v error

70
Q

what function do you use to prune a tree?

A

prune()

71
Q

how could you extract the smallest value of c-v error from the cptable?

A

cp.min <= rpartobject $ cptable[which.min( rpartobject$cptable[, “xerror”] ), “CP”]

72
Q

how to use the prune() function to prune a tree?

A

prune( rpartobject, cp = cp.min)

cp.min extracted using the “which.min” method