Tree based methods Flashcards

Question 1

Q

Can tree based methods be used for regression or classifcation problems?

Question 2

Q

Regression Tree Process

Answer

A

Divide the predictor space into J non-overlapping regions
For every observation in these regions, make the same prediction which is the mean of that region.
Uses a top down, greedy approach does only whats best in the NEXT step. Doesn’t necesarily find the optimal tree
Apply regularization to prune the tree, use CV to choose the regularization parameter.

Question 3

Q

Classification Tree Process

Answer

A

Same as Regression, except the prediction is just the most commonly occurring class in the terminal node / region.

Question 4

Q

How to handle categorical variables with trees

Answer

A

You don’t need to dummy code them, it handles the variables automatically!

Question 5

Q

Pros/Cons of Decision Trees

Answer

A

Interpretability, most closely mirrors human decision making, nice graphical display, don’t have to dummy code categorical variables.
Low predictive power, but - can use methods that aggregate many decision trees like: bagging, boosting, random forest.
Suffer from high variance.

Question 6

Q

Difference between Bagging and Bootstrap

Answer

A

Bagging = Bootstrap, basically build many predictive models on separate training sets and average the results.

Want to do this with algos that are susceptible to high variance like trees, although can be applied to almost any method.

Question 7

Q

What happens when Y is qualitative/categorical when you use bagging?

Answer

A

Bagging = Bootstrap, but then it becomes a majority vote of the models

Question 8

Q

Test Error with bagging

Answer

A

You hold out part of your data set (typically 1/3rd), called Out-Of-Bag observations(OOB).

Question 9

Q

Disadvantages of Bagging

Answer

A

Bagging = Bootstrap
Can be difficult to interpret the resulting model, because you have lots of trees. HOWEVER can obtain an overall summary of the importance of each predictor using RSS (bagging regression trees), or Gini Index (bagging classification trees).

Question 10

Q

Variable importance in Bagging vs. Non Bagged Trees

Answer

A

W/Non Bagged Trees, the top most layer of the tree is the most important the lowest level leaves of tree are least important.
Bagged trees - since you have so many trees you can get R to output a variable importance metric

Question 11

Q

How does a random forest work

Answer

A

Like a regression tree, except that at each split model is only allowed to consider a random sample of m predictors from the full set of p predictors, usually about square root(p) predictors. Each split a fresh sample of m predictors is chosen.

Question 12

Q

Rationale of Random Forest

Answer

A

Counteracts the disadvantage of the top-down, greedy approach of decision trees. This decorrelates the trees and reduces variance better than boosting as well. Usually use a large number of trees to allow the error rate to settle down.

Question 13

Q

RAndom forest vs. Bagging

Answer

A

Random forest is generally superior to bagging/bootstrap.

Question 14

Q

What is boosting?

Answer

A

Generally is an iterative learning process whereby models are successively fit against a previous model’s residuals. This method allows the model to iteratively increase accuracy in areas it doesn’t perform well. There are 3 parameters: (1) # of trees (2) regularization parameter (3) number of d splits in each tree

Question 15

Q

bagging vs random forest

Answer

A

Bagging - trees can be highly correlated because they all use the greedy top down approach, even though there are multiple training sets. In random forest, we force the tree to chose its splits from a random subset of the dimensions in the data.

Tree based methods Flashcards

(15 cards)