Tree based methods Flashcards

1
Q

Can tree based methods be used for regression or classifcation problems?

A

BOTH

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Regression Tree Process

A
  1. Divide the predictor space into J non-overlapping regions
  2. For every observation in these regions, make the same prediction which is the mean of that region.
  3. Uses a top down, greedy approach does only whats best in the NEXT step. Doesn’t necesarily find the optimal tree
  4. Apply regularization to prune the tree, use CV to choose the regularization parameter.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Classification Tree Process

A

Same as Regression, except the prediction is just the most commonly occurring class in the terminal node / region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to handle categorical variables with trees

A

You don’t need to dummy code them, it handles the variables automatically!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pros/Cons of Decision Trees

A

Interpretability, most closely mirrors human decision making, nice graphical display, don’t have to dummy code categorical variables.
Low predictive power, but - can use methods that aggregate many decision trees like: bagging, boosting, random forest.
Suffer from high variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Difference between Bagging and Bootstrap

A

Bagging = Bootstrap, basically build many predictive models on separate training sets and average the results.

Want to do this with algos that are susceptible to high variance like trees, although can be applied to almost any method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What happens when Y is qualitative/categorical when you use bagging?

A

Bagging = Bootstrap, but then it becomes a majority vote of the models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Test Error with bagging

A

You hold out part of your data set (typically 1/3rd), called Out-Of-Bag observations(OOB).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Disadvantages of Bagging

A

Bagging = Bootstrap
Can be difficult to interpret the resulting model, because you have lots of trees. HOWEVER can obtain an overall summary of the importance of each predictor using RSS (bagging regression trees), or Gini Index (bagging classification trees).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Variable importance in Bagging vs. Non Bagged Trees

A

W/Non Bagged Trees, the top most layer of the tree is the most important the lowest level leaves of tree are least important.
Bagged trees - since you have so many trees you can get R to output a variable importance metric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does a random forest work

A

Like a regression tree, except that at each split model is only allowed to consider a random sample of m predictors from the full set of p predictors, usually about square root(p) predictors. Each split a fresh sample of m predictors is chosen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Rationale of Random Forest

A

Counteracts the disadvantage of the top-down, greedy approach of decision trees. This decorrelates the trees and reduces variance better than boosting as well. Usually use a large number of trees to allow the error rate to settle down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

RAndom forest vs. Bagging

A

Random forest is generally superior to bagging/bootstrap.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is boosting?

A

Generally is an iterative learning process whereby models are successively fit against a previous model’s residuals. This method allows the model to iteratively increase accuracy in areas it doesn’t perform well. There are 3 parameters: (1) # of trees (2) regularization parameter (3) number of d splits in each tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

bagging vs random forest

A

Bagging - trees can be highly correlated because they all use the greedy top down approach, even though there are multiple training sets. In random forest, we force the tree to chose its splits from a random subset of the dimensions in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly