Chapter 5 - Decision Trees Flashcards
Decision Trees (5)
A decision tree divides the feature space (i.e., the space of all possible combinations of feature values or levels) into a finite number of non-overlapping regions containing relatively homogenous observations more amenable to analysis and prediction.
To predict a given observation, simply locate the region to which it belongs and use the average value (mean; for regression problems) or the most common class (mode; for classification problems) of the target variable in that region.
Decision Tree Terminology (3; 5.1.1)
1) Node - a point on a decision tree that corresponds to a subset of the feature space
a) Root node - node at the top of a decision tree representing the full feature space and containing the entire training set
b) Terminal nodes (leaves) - nodes at the bottom of the tree which are not split further and constitute an end point
2) Binary tree - decision tree in which each parent node is followed by two children sub-nodes
3) Depth - number of tree splits needed to go from the root node to the furthest terminal node; measure of complexity of a decision tree. A tree is balanced if the left and right subtrees coming out of any node differ in depth by at most one.
Recursive binary splitting - definition (2; 5.1.1)
1) It is customary to split the feature space using binary splits on one and only one predictor at a time. Decisions for each split include:
a) Which predictor to use
b) What cutoff value to use (for numerical predictors) or what partition of possible levels (for categorical predictors)
2) Node impurity - how dissimilar training observations are in a given node. Goal is to reduce impurity (or maximize information gain
a) Regression trees - use RSS of the target variable fitted to the target mean (residual difference between observation and target mean).
b) Classification trees - see next slide for impurity measures.
Classification tree impurity measures (5.1.1)
1) Classification error rate - proportion of node observations not in the modal class
a) Not sensitive enough to node purity because depends on class proportions only through the maximum (i.e., if two classes have the same max proportion, they will have same classification error rate)
2) Gini index - G.m = SUM(i = k to K) [p.mk * (1 - p.mk)]
K = number of classes
p.mk = proportion of observations in class k in node R.m
a) Lower the better, min 0 (all observations in one class), max 1/K (all observations evenly spread amongst all classes)
b) Default
3) Entropy - D.m = - SUM(i = k to K) [p.mk * log2(p.mk)]
a) Lower the better, min 0 (all observations in one class), max log2(K) (all observations evenly spread amongst all classes)
Recursive binary splitting - process (5.1.1)
1) First split from root node should minimize overall RSS of both sub-nodes for all choices of predictor and cutoff value (in regression trees) or lead to a minimized average of the impurity measures in each sub-node, weighted on number of observations in each node (in classification trees)
2) Next split should seek the best predictor and cutoff value to split one of the sub-regions to further minimize the impurity of the tree
3) Subsequent splits - recursively repeat process (each split operates on results of previous splits) until a stopping criterion is met (no splits lead to meaningful decreases in impurity; observations in each terminal node fall below some specified threshold)
4) “Top-down” algorithm - starting at top, work our way down tree
5) “Greedy” algorithm - technical term that means at each step, we select the split that leads to the greatest reduction in impurity at that point (not looking ahead to construct a better tree down the line)
Tree pruning methods (3; 5.1.1)
1) Pruning is a means to control complexity of the tree (recursive binary splitting can easily lead to an overfitted tree)
2) Set an impurity reduction threshold, only make split if threshold is met
a) Disadvantage that since the algorithm is greedy, a not-great split may lead to a great split in a future iteration
3) Cost complexity pruning - first grow a large tree, then prune the tree using this method (see next card)
Cost complexity pruning (5.1.1)
Trade off the goodness of fit of the tree to the training model with the size of the tree with the penalized objective function:
Relative training error + c.p * |T|
Relative training error (regression trees) = RSS/TSS
Relative training error (classification trees) = misclassifications made by tree / misclassifications made by using majority class
c.p = complexity parameter (chosen beforehand, between 0 {no penalty, most elaborate tree} and 1 {max penalty, results in no tree})
|T| = number of terminal nodes
Notes
a) Each increasing c.p provides a nested tree (nests within the elaborate, full tree)
b) c.p can be tuned using CV
Handling of predictors in decision trees vs GLMs (2; 5.1.1)
1) Numerical
a) GLMs - The effect of a predictor on the target mean is monotonic, and can only be expressed as non-monotonic through the use of additional features (such as polynomial additions)
b) Decision trees - predicted means or probabilities are allowed to vary irregularly in different subsets of the feature space, allowing for more complex relationships between target variable and predictor to be captured
2) Categorical
a) GLMs - categorical variables must first be binarized, with each level receiving a coefficient
b) Decision trees - categorical variables should not be binarized, as a split can occur in a binary fashion (split a | b, c, d; i.e. not a) or by combining different levels into one subset (split a, b | c, d)
3) Interactions
a) GLMs - interactions must manually be added as additional features
b) Decision trees - each subsequent node splits a subset of the feature space. When splits are based on different predictors, an interaction effect is automatically modeled
4) Collinearity
a) GLMs - vulnerable to the presence of highly correlated variables, which can substantially increase variance
b) Decision trees - less of a problem, and will not derail algorithm. Which predictor to select for a split may be chosen rather randomly, which can skew variable importance scores.
5) Variable transformations - can affect both a fitted GLM and fitted decision tree, with one notable exception for decision trees: monotonic transformations of numeric PREDICTORS will not affect a decision tree. Therefore, decision trees can handle non-linearity and skewness without the need of transformations.
Pros of decision trees (5; 5.1.1)
1) Interpretability - so long as there are not too many buckets, trees are easy to interpret and explain due to their if/then logic. They can also be represented graphically, increasing their interpretability.
2) Handling of complex relationships between target variable and predictors - they excel in automatically recognizing non-linear and non-additive (i.e., interactions) relationships
3) Categorical variables do not need binarized or a baseline level selected
4) Variable selection - trees automatically select variables. Most important variables show up at top of tree and unimportant ones are filtered out.
5) Can deal with missing data rather easily
Cons of decision trees (4; 5.1.1)
1) Overfitting - decision trees are more prone to overfitting and tend to produce unstable predictions with high variance, even with pruning.
2) Numeric variables may need split multiple times, making the tree less interpretable and larger in dimension, where a GLM will apply a single coefficient to a numeric variable
3) Decision trees favor high-level categorical predictors. Due to the amount of possible splits, it is easy to choose a split that has a high information gain, but which is not part of the true signal. Combining factor levels may remedy this.
4) Model diagnostics - there is a lack of model diagnostic tools for decision trees
Ensemble Model I: Random Forests (5.1.2)
Random forests entail generating multiple bootstrapped (randomly sampled with replacement) samples of the training data and fitting base trees in parallel independently on each of the bootstrapped training samples.
Calculating overall prediction from random forest (5.1.2)
1) Numerical target variables - overall prediction is the simple average of the predictions from the B base trees. Can also weight average on accuracy.
2) Categorical target variables - two options
a) Majority vote - for a given cutoff (usually 0.5), convert B base probabilities into B predictions, then select the class that is predicted most.
b) Average B probabilities, then apply averaged probability against a cutoff to select predicted class
Random forest randomization (5.1.2)
1) At each split, we only consider a subset (m) of all predictor variables (p) available. This subset is selected randomly and refreshed at each split. This randomization decorrelates base trees by making them use more diverse predictors, leading to greater variance reduction when the base trees are combined.
2) As m increases from 1 to p, variance increases while bias decreases.
3) Selecting m - commonly set to sqrt(p) for classification problems and p/3 for regression problems, but hyperparameter can be tuned by CV. When m = p, process called ‘bagging’.
Pros and cons of random forests compared to a single decision tree (3; 5.1.2)
1) Pro - much more robust, and even though trees are unpruned, leads to significant variance reduction given a substantial number of base trees and much more precise predictions.
Cons
2) Interpretability - far less interpretable due to two inter-related reasons
a) Cannot display the statistical learning procedure graphically with a series of if/then gates as we can with a single tree
b) It is not simple to visualize the relationships between predictors and the target variable in a random forest
3) Takes considerably longer to implement a random forest
Ensemble Model II: Boosting (5.1.3)
Definition - gradient boosting machines, or boosting, builds a sequence of inter-dependent trees, each fitted to the residuals of the preceding tree. A scaled-down version of the current tree’s predictions are subtracted from the residuals to form new residuals. The process is repeated, with the effect being that each tree will focus on predicting observations that the previous tree predicted poorly, progressively moving in the direction of the negative “gradient” of the loss function we are trying to minimize.
Overall prediction is the sum of the scaled-down prediction of each base tree.
Using scaled-down predictions (via a shrinkage parameter) slows down the algorithm to prevent overfitting and improve predictive accuracy.