Trees and Forest Flashcards

Question 1

Q

alternative names of regression trees

Answer

A

CART (classification and regression trees);
Recursive partitioning methods.

Question 2

Q

Random forests are…

Answer

A

Random forests are then collections of trees.

A random forest is a collection of decision trees (e.g. regression trees) generated by applying two separate
randomisation processes:
1. The observations (rows) are randomised through a bootstrap resample.
2. A random selection of predictors (columns) are considered for each split, rather than considering all
variables

Random forests are examples of ensemble methods.

Question 3

Q

binary splits

Answer

A

Most commonly, the groups are formed by a sequence of binary splits.

Question 4

Q

The basic idea of regression trees

Answer

A

The basic idea is to split the data into groups using the predictors, and then estimate the response within
each group by a fixed value

Question 5

Q

binary tree

Answer

A

The resulting partition of the data can be described by a binary tree. a binary tree is a tree data structure in which each node has at most two children, referred to as the left child and the right child.

Question 6

Q

the target value at each leaf

Answer

A

At each leaf the target is estimated by
the mean value of the y-variable for
all data at that leaf.

Question 7

Q

At each stage in tree growth, how we select the best split?

Answer

A

Which node to split at?. The goal is to find the node that helps separate the data into more homogeneous groups with respect to the target variable.

Which variable to split with?: At each node, the decision tree algorithm considers splitting the data based on different features or variables. It evaluates each variable to see which one best separates the data into distinct groups. The variable that results in the greatest improvement in predictive accuracy is chosen for splitting at that node.

What value of that variable to split at?: Once the variable is chosen for splitting, the algorithm determines the optimal value to split the data. This value could be a specific threshold for continuous variables (e.g., height > 170 cm) or a categorical value for categorical variables (e.g., color = “red”). The algorithm searches for the value that maximizes the improvement in predictive accuracy.

The best split is the one which results in smallest residual sum of squares on the training data

Question 8

Q

for factors with many levels, how many possible splits to consider?

Answer

A

If there are levels, then there are k
2^(k−1) −1
possible splits to consider

Question 9

Q

function which fit the regression tree

Answer

A

rpart() command fits a regressio tree, using a similar syntax to lm().

wage.rp <- rpart(WAGE ~ . ,data = wage.train)

Question 10

Q

regression tree how to visualize with Rcode

Answer

A

plot(wage.rp, compress = TRUE, margin = 0.1)
text(wage.rp)

The plot command visualises the tree, whilst the text
command adorns it with labels.
The compress=TRUE tends to make the tree more visually appealing, while margin=0.1 adds a bit of whitespace.

Question 11

Q

surrogate splits.

Answer

A

Regression trees can handle missing
data using surrogate splits.

Question 12

Q

Pruning trees

Answer

A

Pruning trees is a technique used to prevent overfitting and improve the generalization ability of decision trees. Overfitting occurs when a tree captures noise in the training data and performs poorly on unseen data. Pruning helps to simplify the tree by removing branches that do not provide significant improvement in predictive accuracy.

Consider the bias-variance trade off:

One observation per leaf implies lots of flexibility in the model (so low bias) but high variability.

Many observations per leaf reduce flexibility (introduce bias) but reduce variability

Question 13

Q

complexity parameter

Answer

A

Model Complexity:
Low complexity - high bias - low variability.
High complexity - low bias - high variability.
Specified by cp argument of rpart(), default cp=0.01.
cp=0.1 - simple
cp =0.0001 complex.

The default value of cp = 0.01 is only a rule-of-thumb.

Pick the value of cp that minimizes prediction error.

Question 14

Q

Cross-Validation

Answer

A

The idea is that we split the data into equally size blocks (subgroups). Each block in turn is set aside as the validation data, with the remaining blocks combining to form the
training set.

printcp(wage.rp.3)

The xerror column contains cross validation estimates of the (relative) prediction error.

xstd is the standard error (i.e. an estimate of the uncertainty) for these cross-validation estimates

Question 15

Q

The forest benefits from the instability of the trees

Answer

A

Each bootstrap resample is likely to generate a different tree as tree building is brittle.

Considering only a subset of the available predictor variables for each split in the tree helps ensure the trees are different

Question 16

Q

Random forest: Tuning

Answer

Study These Flashcards

A

We can tune the number of predictors to consider at each split using
mtry
.

Question 17

Q

nodesize

Answer

Study These Flashcards

A

The tree depth is controlled by nodesize , the minimum size of nodes before a split is allowed.
This defaults to 5 for prediction.

Trees and Forest Flashcards

week five (17 cards)