Decision Trees Flashcards
What are decision trees?
Supervised models for EDA designed to stratify or segment the predictor space into several simple regions
Terminal Nodes
The leaves of the trees showing the observations
Internal Nodes
Points along the tree where the predictor space is split
Branches
Segments of the trees connecting the internal nodes and terminal nodes
Top-Down, Greedy Approach (Recursive Binary Splitting)
Start at the top of the tree and successively split the predictor space and at each step rather than looking ahead and picking a split that will lead to a better tree, we use the best split for that particular step
Why do we use the top-down greedy approach?
It is computationally impossible to consider every possible partition of the predictor space. So we choose the first split of the tree that will minimize the sum of squared error
What is tree pruning?
When we find the largest tree (T0) and then prune it back to find an optimal subtree. We select the tree with the lowest test error rate
What is cost complexity pruning?
Rather than considering every possible tree, we consider a sequence trees indexed by a tuning parameter
What does a high tuning parameter (a) indicate?
Higher punishment so the tree is forced to split higher up and thus more overfittings will happen
Which classification error is preferred if prediction accuracy of the pruned tree is the goal?
Classification Error Rate
What classification errors are used to evaluate quality of particular splits and why?
Entropy and Gini Index since they are more sensitive to node purity.
Describe some advantages of decision trees?
- Easy to explain to people
- More closely mirror human-decision making than regression and classification approaches
- Can be displayed graphically, and easily interpreted even by a non-expert
- Easily handle qualitative predictors without the need to create dummy variables
Describe some disadvantages of decision trees?
- Generally, do not have the same level of predictive accuracy as some of the other regression and classification approaches
- Additionally, trees can be very non-robust, so a small change can lead to a larger one in the final estimated tree
What do methods like bagging, random forest, and boosting do to trees?
Improve the predictive performance of the trees
What is the goal of bagging?
Reduce variance since decision trees tend to have high variance.
Steps of Bagging
- First, generate M different bootstrapped trained datasets, each of which is a random sample.
- Next, we build regression trees without pruning for each of the M bootstrapped training datasets
- Finally, we average all the predictions from the M regression trees (if response variable is numeric) and majority vote if categorical (mode of results)
What is random forests method?
Same process as bagging except each split in each tree is only allowed to use a subset of the possible predictors which helps decorrelate the trees and reduces variance
What is a downside of bagging?
If we use just bagging, then all the bagged trees will look similar to each other since they each picked the one very strong predictor in each M. So, the predictions in each of the bagged trees will be highly correlated which will not produce substantial reductions in variance.
What is boosting?
Creating many trees in a sequential manner where we fit a small tree each time to the residuals from the previous tree. A shrunken version of the new tree is added to the previous tree and residuals get updated. This process is repeated many times to arrive at a final model
Describe the 3 tuning parameters of bagging?
- Number of trees, B. If this is large, boosting can overfit. We use cross-validation to select B
- The shrinking parameter, a small positive number, which can control the rate at which boosting learns. Want a very large B to complement to achieve good performance
- Number of d splits in each tree, controlling the complexity of the boosted ensemble, the interaction order of the boosted model since d splits can involve d variables