Decision Trees Flashcards
Briefly explain how a decision tree works
Decision trees divide the feature space into a finite set of non-overlapping regions containing relatively homogeneous observations more amenable to analysis and prediction
Explain why the classification error rate is generally less sensitive to node impurity than the Gini index and entropy
This is because the classification error rate depends on class proportions through the maximum. If two nodes have the same maximum proportion, then they will always share the same classification error rate, regardless of the distribution of the observations in other classes
Recursive binary splitting is a top-down, greedy algorithm. Explain the meaning of the terms “top-down” and “greedy”
- “Top-down” means that we start from the top of the tree and go down, sequentially partitioning the feature space in a series of binary splits
- “Greedy” means that we adopt the binary split that leads to the greatest reduction in impurity at that point, rather than looking ahead and selecting a split that results in a better tree in a future step
Explain why it is not a good idea to grow a tree by specifying an impurity reduction threshold and making a split only when the reduction in impurity exceeds the threshold
This is because it overlooks the important fact that a “not-so-good split” early on in the tree-building process may be followed by a good split, which means that we will never arrive at the good split that comes later
Explain how cost-complexity pruning works
Cost-complexity pruning involves growing a large tree and then pruning it back by dropping splits that do not reduce the model error by a fixed value determined by the complexity parameter. We can use cross validation to optimize the complexity parameter, which is the process repeatedly training models and testing models on different folds of the data. This is done for different values for the complexity parameter, and the one with the lowest cross validation error is selected as the optimal choice. We then prune back our trained tree using the complexity parameter from the cross validation.
Cost complexity pruning includes a tradeoff between the goodness of fit of the tree to the training data with the size/complexity of the tree
Explain how the complexity parameter affects the quality of a decision tree
The complexity parameter is a penalty term in place to prevent overfitting
* cp = 0, no penalty for a complex tree, and the fitted tree is identifical to the large, overblow tree without pruning
* cp increases from 0 to 1, the penalty term increases and the tree branches are snipped off retroactively to form new, larger terminal nodes. The squared bias of the tree predictionrs will increase, but the variance will drop
* cp = 1, then the tree is minimized to one with the root node only (intercept-only linear model). In this instance, the drop in relative traning error can never compensate for the increase in penalty, and no splits occur
Explain how a monotone transformation on a numeric predictor and on the target variable affects a decision tree
Monotone transformation on a numeric predictor: will produce the same decision tree because the split points are based on the ranks of the feature values, not their actual values
Monotone transformation on the target variable: will not produce the same decision tree because the split points are on different cutoff values
Explain why a decision tree tends to favor categorical predictors with many levels
This is because there are numerous ways to split its levels into two groups, making it easier to choose a seemingly good split based on a multi-level categorical predictor for the particular training data at hand, even though that predictor is not necessarily an important one. This will likely lead to overfitting.
Explain how a random forest works
- A random forest works by first taking the training data and using bootstrap to produce B bootstrapped training samples with the same size as the original training set (independently and randomly sampling with replacement)
- Then, each of the bootstrapped training samples are fitted to their respective training sets to produce B fitted trees
- Each of the results from the B fitted trees are combine to form an overall prediction
* Numeric target variables = average of B predictions
* Categorical target variables = majority vote
Summary = making independent bootstrapped samples, fititng a decision tree to each bootstrapped sample separately, and averaging the predictions to these trees to reduce variance
Explain how a boosted tree works
Boosting builds a sequence of interdependent trees using information from previously grown trees. In each iteration, a tree is fit to the residuals of the preceeding tree and a scaled-down version of the current tree’s predictions is added to previous predictions, focusing on predicting observations that the previous tree predicted poorly.
Explain the differences between a random forest and a boosted tree
- In parallel vs. in series: base trees in a random forest are fitted in parallel, independently of one another. base trees in a boosted model are fitted in series
- Bias vs variance: random forests address model variance, and boosted trees focus on a reduction in model bias
Explain the considerations that go into the choice of the mtry parameter of a random forest
- The larger the value of mtry (the number of features considered in each split), the less variance reduction
- If mtry is too small, each split may have too little freedom in choosing the split variables
Explain what a partial dependence plot is and in what way it improves over a variable importance plot
Partial dependence plots = attempts to visualize the association between a given variable and the model prediction after averaging the values of other variables. These plots give insight into how the target variable depends on other predictors
Variable importance plot = tells us which predictors are most influential, but doesn’t shed light on the relationship between the predictors and the target variable (for ex., whether positive, negative, or complex relationship)
What is information gain?
Information gain is the decrease in impurity created by a decision tree split. At each node of the decision tree, the model selects the split that results in the most information gain. Therefore, the choice of impurity measure (e.g., Gini or entropy) directly impacts information gain calculations.
Compare random forests and gradient boosting machines
Random forests tend to do well in reducing variance while having similar bias to that of a basic tree model.
Gradient boosting machines use the same underling training data at each step. This is very effective in reducing bias but is very sensitive to the training data (high variance)
What is a node? Explain the different types of tree nodes
A node is a point on a decision tree that corresponds to a subset of the training data
* Root node: the node at the top of a decision tree representing the full dataset
* Terminal nodes (leaves): the nodes at the bottom of a tree which are not split further and constitute an end point