Regression Trees Flashcards
What are the differences between regression, LASSO, and machine learning?
Regression: analyst defines variables, functional form, other models, and selects best method
LASSO: analyst defines variables, functional form, analyst defines broadest model, and algorithm selects best model
Machine Learning: analyst defines a set of variables, not functional form, algorithm defines possible model and the best model. No formula, no coefficients
What is a regression tree?
A model that produces predicted values of y for observations with particular valeus of the x variables
What algorithm do regression trees use?
CART: Classification and Regression Trees
Define top node and the cutoff point.
Top node: the starting point of the sample
Cutoff point: the value at which the split happens for a given variable x
When do we stop branching?
By using a stopping rule. This can be as simple as stopping after a certain number of levels, minimum number of observations, and the minimum amount of fit improvement (complex parameter)
What are terminal nodes?
The set of nodes after the algorithm stops
How do we pick the best possible model with regression trees?
Find the predictor AND the cutoff that yields the smallest RMSE
What are the differences between OLS and CART?
The output for OLS are coefficient estimates and CART are a corresponding x values and average y values
How can we improve the CART method?
- Early stopping rule (short-sighted: a split now may not be as important as a split later)
- Prune a tree
How do we prune a tree?
- Grow a tree, with a small stopping rule
- Prune it afterwards, then let it grow again
Define cost complexity pruning.
At each level, delete one node, the one that brings the least improvement to the model. And stop when we cannot imporve the model fit by deleting any of the nodes
Define variable importance.
Measures how much fit improve when a particular x variable is used for splitting, summed across all splits in which that variable occurs
ie: measures how important a variable is in improving the prediction
T/F: We want low variable importance in a model.
False: want high variable importance to determine if a variable should be included in the model
T/F: Variable importance for a regression tree is a post prediction diagnostic tool, and therefore should be performed on the holdout set.
True
Why is a regression tree a regression?
It calculates the average y as a function of x
Why is the regression tree a non-parameric regression?
There are no parameters (intercept, slope coefficents) that describe how average y depends on the values of x.
What is a top down, greedy approach?
Top down: begins at the top and trickles its way down
Greedy: makes the best split at a particular step and does not look ahead for better splits
What can’t we do when the CART algorithm performs automatic pattern detection?
Cannot make mistakes by omitting something that would be important
What are the disadvantages of using automatic pattern detection?
- Might lead to overfitting, even after pruning
- The algorithm is sensitive to extreme values and data errors
- variable importance still does not tell us how important variable x is associated with y, in what functional form, or through what interactions (black box)
Explain how a regression tree works.
First a regression tree looks at all your x variables and chooses which x, and which value to split the observations into two groups based on the lowest RMSE. Then, it repeats this process over and over again. It will look at the next nodes, and determine which x variable and what value to split to return the lowest RMSE. This will continue until it reaches a stopping rule or there is only 1 observation in each bin.