Chapter 5 - Decision Trees Flashcards

1
Q

Decision Trees (5)

A

A decision tree divides the feature space (i.e., the space of all possible combinations of feature values or levels) into a finite number of non-overlapping regions containing relatively homogenous observations more amenable to analysis and prediction.

To predict a given observation, simply locate the region to which it belongs and use the average value (mean; for regression problems) or the most common class (mode; for classification problems) of the target variable in that region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Decision Tree Terminology (3; 5.1.1)

A

1) Node - a point on a decision tree that corresponds to a subset of the feature space
a) Root node - node at the top of a decision tree representing the full feature space and containing the entire training set
b) Terminal nodes (leaves) - nodes at the bottom of the tree which are not split further and constitute an end point

2) Binary tree - decision tree in which each parent node is followed by two children sub-nodes

3) Depth - number of tree splits needed to go from the root node to the furthest terminal node; measure of complexity of a decision tree. A tree is balanced if the left and right subtrees coming out of any node differ in depth by at most one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Recursive binary splitting - definition (2; 5.1.1)

A

1) It is customary to split the feature space using binary splits on one and only one predictor at a time. Decisions for each split include:
a) Which predictor to use
b) What cutoff value to use (for numerical predictors) or what partition of possible levels (for categorical predictors)

2) Node impurity - how dissimilar training observations are in a given node. Goal is to reduce impurity (or maximize information gain
a) Regression trees - use RSS of the target variable fitted to the target mean (residual difference between observation and target mean).
b) Classification trees - see next slide for impurity measures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classification tree impurity measures (5.1.1)

A

1) Classification error rate - proportion of node observations not in the modal class
a) Not sensitive enough to node purity because depends on class proportions only through the maximum (i.e., if two classes have the same max proportion, they will have same classification error rate)

2) Gini index - G.m = SUM(i = k to K) [p.mk * (1 - p.mk)]
K = number of classes
p.mk = proportion of observations in class k in node R.m
a) Lower the better, min 0 (all observations in one class), max 1/K (all observations evenly spread amongst all classes)
b) Default

3) Entropy - D.m = - SUM(i = k to K) [p.mk * log2(p.mk)]
a) Lower the better, min 0 (all observations in one class), max log2(K) (all observations evenly spread amongst all classes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Recursive binary splitting - process (5.1.1)

A

1) First split from root node should minimize overall RSS of both sub-nodes for all choices of predictor and cutoff value (in regression trees) or lead to a minimized average of the impurity measures in each sub-node, weighted on number of observations in each node (in classification trees)
2) Next split should seek the best predictor and cutoff value to split one of the sub-regions to further minimize the impurity of the tree
3) Subsequent splits - recursively repeat process (each split operates on results of previous splits) until a stopping criterion is met (no splits lead to meaningful decreases in impurity; observations in each terminal node fall below some specified threshold)

4) “Top-down” algorithm - starting at top, work our way down tree
5) “Greedy” algorithm - technical term that means at each step, we select the split that leads to the greatest reduction in impurity at that point (not looking ahead to construct a better tree down the line)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Tree pruning methods (3; 5.1.1)

A

1) Pruning is a means to control complexity of the tree (recursive binary splitting can easily lead to an overfitted tree)

2) Set an impurity reduction threshold, only make split if threshold is met
a) Disadvantage that since the algorithm is greedy, a not-great split may lead to a great split in a future iteration

3) Cost complexity pruning - first grow a large tree, then prune the tree using this method (see next card)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cost complexity pruning (5.1.1)

A

Trade off the goodness of fit of the tree to the training model with the size of the tree with the penalized objective function:

Relative training error + c.p * |T|

Relative training error (regression trees) = RSS/TSS
Relative training error (classification trees) = misclassifications made by tree / misclassifications made by using majority class
c.p = complexity parameter (chosen beforehand, between 0 {no penalty, most elaborate tree} and 1 {max penalty, results in no tree})
|T| = number of terminal nodes

Notes
a) Each increasing c.p provides a nested tree (nests within the elaborate, full tree)
b) c.p can be tuned using CV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Handling of predictors in decision trees vs GLMs (2; 5.1.1)

A

1) Numerical
a) GLMs - The effect of a predictor on the target mean is monotonic, and can only be expressed as non-monotonic through the use of additional features (such as polynomial additions)
b) Decision trees - predicted means or probabilities are allowed to vary irregularly in different subsets of the feature space, allowing for more complex relationships between target variable and predictor to be captured

2) Categorical
a) GLMs - categorical variables must first be binarized, with each level receiving a coefficient
b) Decision trees - categorical variables should not be binarized, as a split can occur in a binary fashion (split a | b, c, d; i.e. not a) or by combining different levels into one subset (split a, b | c, d)

3) Interactions
a) GLMs - interactions must manually be added as additional features
b) Decision trees - each subsequent node splits a subset of the feature space. When splits are based on different predictors, an interaction effect is automatically modeled

4) Collinearity
a) GLMs - vulnerable to the presence of highly correlated variables, which can substantially increase variance
b) Decision trees - less of a problem, and will not derail algorithm. Which predictor to select for a split may be chosen rather randomly, which can skew variable importance scores.

5) Variable transformations - can affect both a fitted GLM and fitted decision tree, with one notable exception for decision trees: monotonic transformations of numeric PREDICTORS will not affect a decision tree. Therefore, decision trees can handle non-linearity and skewness without the need of transformations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pros of decision trees (5; 5.1.1)

A

1) Interpretability - so long as there are not too many buckets, trees are easy to interpret and explain due to their if/then logic. They can also be represented graphically, increasing their interpretability.
2) Handling of complex relationships between target variable and predictors - they excel in automatically recognizing non-linear and non-additive (i.e., interactions) relationships
3) Categorical variables do not need binarized or a baseline level selected
4) Variable selection - trees automatically select variables. Most important variables show up at top of tree and unimportant ones are filtered out.
5) Can deal with missing data rather easily

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cons of decision trees (4; 5.1.1)

A

1) Overfitting - decision trees are more prone to overfitting and tend to produce unstable predictions with high variance, even with pruning.
2) Numeric variables may need split multiple times, making the tree less interpretable and larger in dimension, where a GLM will apply a single coefficient to a numeric variable
3) Decision trees favor high-level categorical predictors. Due to the amount of possible splits, it is easy to choose a split that has a high information gain, but which is not part of the true signal. Combining factor levels may remedy this.
4) Model diagnostics - there is a lack of model diagnostic tools for decision trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Ensemble Model I: Random Forests (5.1.2)

A

Random forests entail generating multiple bootstrapped (randomly sampled with replacement) samples of the training data and fitting base trees in parallel independently on each of the bootstrapped training samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Calculating overall prediction from random forest (5.1.2)

A

1) Numerical target variables - overall prediction is the simple average of the predictions from the B base trees. Can also weight average on accuracy.

2) Categorical target variables - two options
a) Majority vote - for a given cutoff (usually 0.5), convert B base probabilities into B predictions, then select the class that is predicted most.
b) Average B probabilities, then apply averaged probability against a cutoff to select predicted class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Random forest randomization (5.1.2)

A

1) At each split, we only consider a subset (m) of all predictor variables (p) available. This subset is selected randomly and refreshed at each split. This randomization decorrelates base trees by making them use more diverse predictors, leading to greater variance reduction when the base trees are combined.

2) As m increases from 1 to p, variance increases while bias decreases.

3) Selecting m - commonly set to sqrt(p) for classification problems and p/3 for regression problems, but hyperparameter can be tuned by CV. When m = p, process called ‘bagging’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Pros and cons of random forests compared to a single decision tree (3; 5.1.2)

A

1) Pro - much more robust, and even though trees are unpruned, leads to significant variance reduction given a substantial number of base trees and much more precise predictions.

Cons
2) Interpretability - far less interpretable due to two inter-related reasons
a) Cannot display the statistical learning procedure graphically with a series of if/then gates as we can with a single tree
b) It is not simple to visualize the relationships between predictors and the target variable in a random forest

3) Takes considerably longer to implement a random forest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ensemble Model II: Boosting (5.1.3)

A

Definition - gradient boosting machines, or boosting, builds a sequence of inter-dependent trees, each fitted to the residuals of the preceding tree. A scaled-down version of the current tree’s predictions are subtracted from the residuals to form new residuals. The process is repeated, with the effect being that each tree will focus on predicting observations that the previous tree predicted poorly, progressively moving in the direction of the negative “gradient” of the loss function we are trying to minimize.

Overall prediction is the sum of the scaled-down prediction of each base tree.

Using scaled-down predictions (via a shrinkage parameter) slows down the algorithm to prevent overfitting and improve predictive accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Differences between random forests and boosting (2; 5.1.3)

A

1) In parallel vs. in series - random forest trees are built in parallel (and thus independent), boosting trees are built in series (dependent on prior trees)

2) Bias vs. variance - random forests improve model prediction performance by addressing variance, where boosting improves performance by reducing bias (boosting therefore leads to models with better predictive accuracy)

17
Q

Pros and cons of boosting (5.1.3)

A

1) Compared to random forests
Pro) Boosting leads to models with better predictive accuracy (bias reduction)
Con) More susceptible to overfitting (multiple base trees all trying to capture the signal in the original data increase variance; where more models = less variance in random forests) and more sensitive to hyperparameter values

2) Compared to single decision trees
The gain in prediction comes at the cost of interpretability and computational efficiency

18
Q

Variable importance plots (5.1.3)

A

1) Variable importance score = sum of impurity reduction (RSS for numeric, Gini index for categorical) over all splits involving a predictor / B (# of base trees)

2) Higher the score, the more important the predictor

3) Variable importance plots plot the importance scores of all predictors from most important to least important

19
Q

Partial dependence plots (PDPs) (5.1.3)

A

1) Partial (or average) dependence plots visualize the marginal effect of a given predictor on the target variable, i.e., the association between the two variables after averaging out the values or levels of other predictors not of interest. Improves on variable importance scores by shedding light on the directionality of the relationship.

20
Q

Limitations of partial dependence (2; 5.1.3)

A

1) Assumes the variable of interest is independent of other variables
2) May miss interactions between variables

By setting x1 the same in all cases, the PDP destroys the relationships between X1 and other variables.