Chapter 5 - Decision Trees Flashcards

1
Q

Terminal Nodes

A

Terminal Nodes/Leaves represent the partitions of the predictor space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Internal Nodes

A

Internal Nodes are points along the tree where splits occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Branches

A

Branches are lines that connect any 2 nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Stump

A

A decision tree with only one internal node is called a stump

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Decision Tree Making Algorithm

A
  1. Construct a large tree with g terminal nodes using recursive binary splitting.
  2. Obtain a sequence of best subtrees, as a function of lambda, using cost complexity pruning
  3. Choose lambda by applying k-fold cross valiation. Select the lambda that results in the lowest cross-validation error.
  4. The best subtree is the subtree created in step 2 with the selected lambda value.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Recursive Binary Splitting

A

A top down and greedy approach to creating decision trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Classification Error Rate (Em)

A

The fraction of training observations that do not belong to the most frequent category.
Em = 1- maxc(pm,c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Gini Index (Gm)

A

The sum from c = 1 to w of pm,c(1-pm,c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cross Entropy (Dm)

A

The negative sum from c = 1 to w of pm,c * lnpm,c

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Residual Mean Deviance

A

deviance / n - g

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pure Node

A

A node is said to be pure if all the observations in that node are of the same category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Classification Error Rate Sensitivitiy

A

Classification error rate is not as sensitive to node impurity as the Gini index and the cross entropy are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Error Rate Preferences

A

Gini index and cross entropy are preferred over classification error rate in tree-growing. Classification error rate is generally preferred in tree-pruning if prediction accuracy is the ultamite goal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cost Complexity Pruning Lambda

A

Cost complexity pruning uses a lambda in its minimization function similar to lasso regression. The higher the value of lambda - the higer the complexity penalty and the smaller the resulting tree and also variance. When lambda = 0 the fitted tree is the largest possible. When lambda is suffuiciantly large, the fitted tree is the simplest tree only consisting of the root node and makes no splits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Regression Trees vs Linear Models

A

If the relationship between the response and the predictors is well approximated by a linear model then linear models will likely outperform trees. If the above relationship is highly non-linear or more complex than a linear model can handle, then trees will likely outperform linear models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Advantages of Trees

A

Easy to interpret and explain. Can be presented visually. Manage categorical variables without the need of dummy variables. Mimic human decision making.

17
Q

Disadvantages of Trees

A

Not robust - A small change in the data can cause a large change in the final tree. Do not have the same degree of predictive accuracy as other statistical methods.

18
Q

Bagging Algorithm

A
  1. Create b bootstrap samples from the original training dataset.
  2. Construct a decision tree for each bootstrap sample using recursive binary splitting
  3. Predict the response of a new observation by averaging the predictions (regression trees) or by using the most frequent category (classification trees) across all b trees.
19
Q

Bagging Properties

A

General purpose procedures that works for many statistical methods - unusual for time series data. Bagging reduces varaince, but it is difficult to intrepret the whole bagged model. Increasing b does not cause overfitting. Each bagged tree uses about 2/3 of the original observations.

20
Q

Out of Bag Observations (OOB)

A

The observations that are not used to train a particular bagged tree are called out-of-bag observations. OOB error is a valid estaimate of test error.

21
Q

Random Forests Algoritm

A
  1. Create b bootstrap samples from the orignial training dataset.
  2. Construct a decision tree for each bootstrap sample using recursive binary splitting. At each split, a random subset of k variables are considered.
  3. Predict the response of a new observation by averaging the predictions (regression trees) or by using the most frequent category (classification trees) across all b trees.
22
Q

Random Forests Properties

A

Bagging is a special case of random forests where k = p. Random forests reduce variance by averaging the results of decorrelated trees. Increasing b does not cause overfitting. Decreaing k reduces the correlation between predictions.

23
Q

Boosting Algorithm

A

Let z1 be the actual response variable y.
1. For k = 1, 2, … , b:
Use recursive binary splitting to fir a tree with d splits to the data with zk as the response.
Update zk by subtracting lambda * fk(X)
2. Calculate the boosted model prediction as f(X) = sum from k=1 to b of lambda * fk(x)

24
Q

Boosting Properties

A

It grows trees sequentially by using information from previous trees. Increasing b can cause overfitting, and b is usually selected by cross-validation. Boosting reduces bias. d, the interation depth, controls the complexity of the boosted model - a tree with d splits has d+1 terminal nodes and if d = 1 the model is an additive model. lambda, the shrinkage parameter controls the rate at which boosting learns - Boosting is an example of slow learning and typically lambda is between 0 and 1. A small lambda may require a large b to achieve a good model