Regression Trees Flashcards

1
Q

What are the differences between regression, LASSO, and machine learning?

A

Regression: analyst defines variables, functional form, other models, and selects best method
LASSO: analyst defines variables, functional form, analyst defines broadest model, and algorithm selects best model
Machine Learning: analyst defines a set of variables, not functional form, algorithm defines possible model and the best model. No formula, no coefficients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a regression tree?

A

A model that produces predicted values of y for observations with particular valeus of the x variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What algorithm do regression trees use?

A

CART: Classification and Regression Trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define top node and the cutoff point.

A

Top node: the starting point of the sample
Cutoff point: the value at which the split happens for a given variable x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When do we stop branching?

A

By using a stopping rule. This can be as simple as stopping after a certain number of levels, minimum number of observations, and the minimum amount of fit improvement (complex parameter)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are terminal nodes?

A

The set of nodes after the algorithm stops

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we pick the best possible model with regression trees?

A

Find the predictor AND the cutoff that yields the smallest RMSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the differences between OLS and CART?

A

The output for OLS are coefficient estimates and CART are a corresponding x values and average y values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can we improve the CART method?

A
  1. Early stopping rule (short-sighted: a split now may not be as important as a split later)
  2. Prune a tree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we prune a tree?

A
  1. Grow a tree, with a small stopping rule
  2. Prune it afterwards, then let it grow again
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define cost complexity pruning.

A

At each level, delete one node, the one that brings the least improvement to the model. And stop when we cannot imporve the model fit by deleting any of the nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define variable importance.

A

Measures how much fit improve when a particular x variable is used for splitting, summed across all splits in which that variable occurs

ie: measures how important a variable is in improving the prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

T/F: We want low variable importance in a model.

A

False: want high variable importance to determine if a variable should be included in the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

T/F: Variable importance for a regression tree is a post prediction diagnostic tool, and therefore should be performed on the holdout set.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is a regression tree a regression?

A

It calculates the average y as a function of x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is the regression tree a non-parameric regression?

A

There are no parameters (intercept, slope coefficents) that describe how average y depends on the values of x.

17
Q

What is a top down, greedy approach?

A

Top down: begins at the top and trickles its way down
Greedy: makes the best split at a particular step and does not look ahead for better splits

18
Q

What can’t we do when the CART algorithm performs automatic pattern detection?

A

Cannot make mistakes by omitting something that would be important

19
Q

What are the disadvantages of using automatic pattern detection?

A
  1. Might lead to overfitting, even after pruning
  2. The algorithm is sensitive to extreme values and data errors
  3. variable importance still does not tell us how important variable x is associated with y, in what functional form, or through what interactions (black box)
20
Q

Explain how a regression tree works.

A

First a regression tree looks at all your x variables and chooses which x, and which value to split the observations into two groups based on the lowest RMSE. Then, it repeats this process over and over again. It will look at the next nodes, and determine which x variable and what value to split to return the lowest RMSE. This will continue until it reaches a stopping rule or there is only 1 observation in each bin.