Regression Trees Flashcards

Question 1

Q

What are the differences between regression, LASSO, and machine learning?

Answer

A

Regression: analyst defines variables, functional form, other models, and selects best method
LASSO: analyst defines variables, functional form, analyst defines broadest model, and algorithm selects best model
Machine Learning: analyst defines a set of variables, not functional form, algorithm defines possible model and the best model. No formula, no coefficients

Question 2

Q

What is a regression tree?

Answer

A

A model that produces predicted values of y for observations with particular valeus of the x variables

Question 3

Q

What algorithm do regression trees use?

Answer

A

CART: Classification and Regression Trees

Question 4

Q

Define top node and the cutoff point.

Answer

A

Top node: the starting point of the sample
Cutoff point: the value at which the split happens for a given variable x

Question 5

Q

When do we stop branching?

Answer

A

By using a stopping rule. This can be as simple as stopping after a certain number of levels, minimum number of observations, and the minimum amount of fit improvement (complex parameter)

Question 6

Q

What are terminal nodes?

Answer

A

The set of nodes after the algorithm stops

Question 7

Q

How do we pick the best possible model with regression trees?

Answer

A

Find the predictor AND the cutoff that yields the smallest RMSE

Question 8

Q

What are the differences between OLS and CART?

Answer

A

The output for OLS are coefficient estimates and CART are a corresponding x values and average y values

Question 9

Q

How can we improve the CART method?

Answer

A

Early stopping rule (short-sighted: a split now may not be as important as a split later)
Prune a tree

Question 10

Q

How do we prune a tree?

Answer

A

Grow a tree, with a small stopping rule
Prune it afterwards, then let it grow again

Question 11

Q

Define cost complexity pruning.

Answer

A

At each level, delete one node, the one that brings the least improvement to the model. And stop when we cannot imporve the model fit by deleting any of the nodes

Question 12

Q

Define variable importance.

Answer

A

Measures how much fit improve when a particular x variable is used for splitting, summed across all splits in which that variable occurs

ie: measures how important a variable is in improving the prediction

Question 13

Q

T/F: We want low variable importance in a model.

Answer

A

False: want high variable importance to determine if a variable should be included in the model

Question 14

Q

T/F: Variable importance for a regression tree is a post prediction diagnostic tool, and therefore should be performed on the holdout set.

Question 15

Q

Why is a regression tree a regression?

Answer

A

It calculates the average y as a function of x

Question 16

Q

Why is the regression tree a non-parameric regression?

Answer

Study These Flashcards

A

There are no parameters (intercept, slope coefficents) that describe how average y depends on the values of x.

Question 17

Q

What is a top down, greedy approach?

Answer

Study These Flashcards

A

Top down: begins at the top and trickles its way down
Greedy: makes the best split at a particular step and does not look ahead for better splits

Question 18

Q

What can’t we do when the CART algorithm performs automatic pattern detection?

Answer

Study These Flashcards

A

Cannot make mistakes by omitting something that would be important

Question 19

Q

What are the disadvantages of using automatic pattern detection?

Answer

Study These Flashcards

A

Might lead to overfitting, even after pruning
The algorithm is sensitive to extreme values and data errors
variable importance still does not tell us how important variable x is associated with y, in what functional form, or through what interactions (black box)

Question 20

Q

Explain how a regression tree works.

Answer

Study These Flashcards

A

First a regression tree looks at all your x variables and chooses which x, and which value to split the observations into two groups based on the lowest RMSE. Then, it repeats this process over and over again. It will look at the next nodes, and determine which x variable and what value to split to return the lowest RMSE. This will continue until it reaches a stopping rule or there is only 1 observation in each bin.

Regression Trees Flashcards

(20 cards)