Chapter 8 - Decision Trees Flashcards
CART - (6)
Classification and Regression Trees
- find a partition in the space of predictors
- predict a constant in each set of the partition
- partition is defined by splitting the range, one predictor at a time
- not all partitions are possible
- the prediction for a point Ri is the average of points in Ri
- has a lot of bias but can reduce variance
How is the tree built?
start with a single region R1 and iterate.
- select a region Rk , a predictor Xj and a splitting point s such that the criterion Xj < s produces the largest decrease in RSS
- refine the regions within this split
- stopping rule - terminate when there are 5 observations or fewer in each partition (too far gets overfitting
- this grows the tree from the roots to the leaves
How do we control for overfitting?
wrong method 1: find the optimal subtree using CV (too many possibilities so overfitting)
wrong method 2: use a rule to decide when we stop growing (doesn’t work because good cuts can live below bad ones)
solution: prune a large tree from leaves to root
Tree Pruning: (2)
weakest link pruning: starting with To, substitute a subtree with a leaf to obtain T1 by minimizing [RSS(T1) - RSS(To) ] / [|To| - |T1|]
iterate this pruning to obtain a sequence To, T1, … to Tm where Tm is the null tree
cost complexity pruning: solve the problem
minimize{RSS} + a*|T| where a is a penalty for large number of leaves
if a is large then we select null tree Tm
if a = 0 then we select the full tree
the solution for each a is among the sequence of trees chosen for weakest link pruning
choose optimal a by CV
Incorrect CV vs Correct CV
incorrect CV creates the trees first, then does kfold split.
correct CV: split data kfold ways. for each training set, select the tree which minimizes RSS. select the tree Ti which minimizes the average test error
Classification Trees - types of classification loss (4 total notes)
predict Ri by majority vote in each region.
types of classification loss: 0-1 loss, gini index, cross entropy (similar to gini)
Gini index is the expected misclassification rate. instead of predicting the most likely class we predict from a distribution of phat_mk
it is typical to use to use Gini and cross to grow the tree, then use misclassification to prune
Advantages of Trees
easy to interpret, closer to human decision making, easy to visualize, easily handle qualitative predictors and missing data
How do we deal with Categorical variables
For 2 categories, the split is obvious. if there are more than 2 categories, order the categories according to the average of the response (this can be shown to be the optimal way to partition
Missing Data
suppose we can assign every sample to a leaf Ri despite missing data
when choosing a new split with variable Xj (growing the tree)
- only consider the samples which have the variable Xj
- in addition to choosing the best split, choose the second best split using a different variable, and the third best (surrogate splits)
- to propagate down a tree, use the surrogate split for which you have information