4.1 Regression and Classification Trees Flashcards
Explain the key differences between decision trees and parametric regression models. How do these differences influence the assumptions and flexibility of decision trees?
Decision trees do not assume a specific distribution for the target variable or a particular relationship with predictors, and they do not require coefficient estimation. Their focus is on making accurate predictions without relying on model assumptions. These differences make decision trees more flexible, allowing them to handle complex relationships without making distributional assumptions.
Describe the main components of a decision tree and their roles in the tree’s structure and prediction process.
- The root is the topmost node containing all observations.
- Splits divide the observations into two nodes based on the predictors.
- The terminal nodes, or leaves, are the bottommost nodes that determine the predictions.
- Edges, or branches, are the lines connecting the nodes.
- Parent nodes are upper nodes connected to lower child nodes.
- Tree depth refers to the longest path from the root to a leaf.
What is the purpose of recursive binary splitting in the construction of decision trees? How does this algorithm determine the best split at each node?
The purpose of a decision tree is to partition the predictor space into distinct regions. Recursive binary splitting provides an algorithmic way to do that. For regression trees, the algorithm evaluates all possible splits at each node, calculates the total error sum of squares (SSE) for the resulting partitions, and selects the split that minimizes SSE. For classification trees, the algorithm examines all potential splits, computes the total impurity for the resulting groups, and selects the split that minimizes impurity or maximizes information gain.
How does the number of terminal nodes in a decision tree relate to its flexibility?
More terminal nodes indicate a more flexible model with higher variance but lower bias. Increased flexibility allows the tree to capture more complex patterns in the data, though it may also lead to overfitting if not controlled.
Explain the concept of pruning and its role in balancing the bias-variance tradeoff.
Pruning reduces the tree’s size by removing parts that have little predictive power. Its goal is to produce a moderately flexible tree that balances variance and bias, generalizing well to unseen data and avoiding overfitting. Pruning is typically performed after growing a large tree.
Differentiate between the splitting criteria used in regression trees and classification trees. How do these criteria affect the way splits are chosen in each type of tree?
Both regression and classification trees aim to maximize information gain, which is equivalent to minimizing node impurity. However, the splitting criteria differ between them. Regression trees minimize the error sum of squares, while classification trees minimize the Gini index or entropy. These criteria ensure optimal splits for each type of target variable.
When interpreting a decision tree plot in R, what information is typically displayed at each node? How can you use this information to make predictions for new observations?
In R, decision tree plots can be interpreted based on the type of tree. For regression trees, the predicted value and the proportion of observations are shown at each node.
For classification trees like the one below, the predicted class, the proportion of observations in the positive class, and the proportion of observations are displayed.
To make predictions, one follows the splitting rules from the root to a terminal node.
Explain the process of cost complexity pruning for decision trees. How does the complexity parameter (cp) influence the size and complexity of the subtree?
The process for minimizing tree complexity uses the formula:
SSE_T + cp|T|SSE_0
where SSE_T is the error sum of squares for subtree T, |T| is the number of leaves, and SSE_0 is the error sum of squares for the tree with no splits. Higher cp values result in smaller, less complex trees, balancing the tree’s error with its complexity.
What is the purpose of cross-validation in the context of pruning decision trees? How can you use cross-validation results to select the optimal tree size?
The purpose of cross-validation is to estimate test error for different tree sizes. It is used to choose the cp value with the lowest cross-validation error or to apply the one-standard-error rule to select the optimal tree size.
What are competitor splits in decision trees? What do they inform?
Competitor splits are the best alternatives to predictors not chosen as the optimal split and indicate splits that are similar to the best split. They reveal significant competition for the top split and help in understanding the stability and alternatives within the tree structure.
What are surrogate splits in decision trees, and how do they help address missing data? How can you control the use of surrogate splits in R?
Surrogate splitting rules provide backup options for observations with missing data in the best split variable. In R, this is controlled by the usesurrogate argument in the rpart.control() function.
Describe the key differences between Poisson trees and standard regression trees. How do these differences account for the unique characteristics of count data?
Poisson trees differ from standard regression trees in that they minimize the deviance statistic instead of SSE and calculate the predictions differently, accounting for exposures. These adjustments are necessary to handle count data and potential variable for exposures.
When using the train() function in R to fit decision trees, what are some of the essential arguments to consider? How do these arguments affect the model training and selection process?
Essential arguments in the train() function in R include specifying the formula or predictors for the model, the method (e.g., “rpart”), cross-validation parameters through trControl, the model selection metric (e.g., RMSE for regression or accuracy for classification), and tuning parameters such as cp values in tuneGrid. These arguments control the model training, tuning, and selection process.
Explain the concepts of balanced and unbalanced decision trees. How might an unbalanced tree arise, and what are the potential consequences?
A balanced tree has left and right sides of every non-terminal node that differ in depth by at most one, while an unbalanced tree has non-terminal nodes with sides differing in depth by more than one. Unbalanced trees may arise from skewed data or specific splitting patterns, potentially leading to overfitting or biased predictions in certain regions.
How can a right-skewed target variable influence the splitting process in decision trees? What is a common strategy to mitigate this issue?
A right-skewed target variable can cause splits to focus on the right tail, resulting in an unbalanced tree that strongly fits the right tail. This occurs because the algorithm tries to minimize SSE, which is sensitive to large values. Mitigating this effect can be done by transforming the target (e.g., using the natural log) before building the tree to achieve a more balanced structure.
Describe how decision trees handle interactions between predictors. How does the depth of the tree relate to the complexity of the interactions it can capture?
Decision trees naturally capture interactions through their structure. An interaction occurs when the relationship between a predictor and the target changes depending on the value of another predictor. The depth of the tree relates to interaction complexity, with a depth of 2 capturing two-way interactions and a depth of 3 capturing up to three-way interactions.
Discuss the main advantages and disadvantages of using decision trees compared to other modeling techniques. Consider factors such as interpretability, accuracy, stability, and overfitting.
The advantages of decision trees include that they do not require distributional assumptions, are easily interpretable, handle qualitative predictors and missing data well, automatically select significant predictors, and are robust to outliers and monotone transformations of predictors. However, they are prone to overfitting, generally have lower predictive accuracy compared to advanced models, are unstable (sensitive to small changes in data), use a greedy algorithm that may not find the global optimum, and can be biased toward predictors with many levels.