Trees and Forest Flashcards
week five
alternative names of regression trees
- CART (classification and regression trees);
- Recursive partitioning methods.
Random forests are…
Random forests are then collections of trees.
A random forest is a collection of decision trees (e.g. regression trees) generated by applying two separate
randomisation processes:
1. The observations (rows) are randomised through a bootstrap resample.
2. A random selection of predictors (columns) are considered for each split, rather than considering all
variables
Random forests are examples of ensemble methods.
binary splits
Most commonly, the groups are formed by a sequence of binary splits.
The basic idea of regression trees
The basic idea is to split the data into groups using the predictors, and then estimate the response within
each group by a fixed value
binary tree
The resulting partition of the data can be described by a binary tree. a binary tree is a tree data structure in which each node has at most two children, referred to as the left child and the right child.
the target value at each leaf
At each leaf the target is estimated by
the mean value of the y-variable for
all data at that leaf.
At each stage in tree growth, how we select the best split?
Which node to split at?. The goal is to find the node that helps separate the data into more homogeneous groups with respect to the target variable.
Which variable to split with?: At each node, the decision tree algorithm considers splitting the data based on different features or variables. It evaluates each variable to see which one best separates the data into distinct groups. The variable that results in the greatest improvement in predictive accuracy is chosen for splitting at that node.
What value of that variable to split at?: Once the variable is chosen for splitting, the algorithm determines the optimal value to split the data. This value could be a specific threshold for continuous variables (e.g., height > 170 cm) or a categorical value for categorical variables (e.g., color = “red”). The algorithm searches for the value that maximizes the improvement in predictive accuracy.
The best split is the one which results in smallest residual sum of squares on the training data
for factors with many levels, how many possible splits to consider?
If there are levels, then there are k
2^(k−1) −1
possible splits to consider
function which fit the regression tree
rpart() command fits a regressio tree, using a similar syntax to lm().
wage.rp <- rpart(WAGE ~ . ,data = wage.train)
regression tree how to visualize with Rcode
plot(wage.rp, compress = TRUE, margin = 0.1)
text(wage.rp)
The plot command visualises the tree, whilst the text
command adorns it with labels.
The compress=TRUE tends to make the tree more visually appealing, while margin=0.1 adds a bit of whitespace.
surrogate splits.
Regression trees can handle missing
data using surrogate splits.
Pruning trees
Pruning trees is a technique used to prevent overfitting and improve the generalization ability of decision trees. Overfitting occurs when a tree captures noise in the training data and performs poorly on unseen data. Pruning helps to simplify the tree by removing branches that do not provide significant improvement in predictive accuracy.
Consider the bias-variance trade off:
One observation per leaf implies lots of flexibility in the model (so low bias) but high variability.
Many observations per leaf reduce flexibility (introduce bias) but reduce variability
complexity parameter
Model Complexity:
Low complexity - high bias - low variability.
High complexity - low bias - high variability.
Specified by cp argument of rpart(), default cp=0.01.
cp=0.1 - simple
cp =0.0001 complex.
The default value of cp = 0.01 is only a rule-of-thumb.
Pick the value of cp that minimizes prediction error.
Cross-Validation
The idea is that we split the data into equally size blocks (subgroups). Each block in turn is set aside as the validation data, with the remaining blocks combining to form the
training set.
printcp(wage.rp.3)
The xerror column contains cross validation estimates of the (relative) prediction error.
xstd is the standard error (i.e. an estimate of the uncertainty) for these cross-validation estimates
The forest benefits from the instability of the trees
Each bootstrap resample is likely to generate a different tree as tree building is brittle.
Considering only a subset of the available predictor variables for each split in the tree helps ensure the trees are different