Regression Trees Flashcards
What are the differences between regression, LASSO, and machine learning?
Regression: analyst defines variables, functional form, other models, and selects best method
LASSO: analyst defines variables, functional form, analyst defines broadest model, and algorithm selects best model
Machine Learning: analyst defines a set of variables, not functional form, algorithm defines possible model and the best model. No formula, no coefficients
What is a regression tree?
A model that produces predicted values of y for observations with particular valeus of the x variables
What algorithm do regression trees use?
CART: Classification and Regression Trees
Define top node and the cutoff point.
Top node: the starting point of the sample
Cutoff point: the value at which the split happens for a given variable x
When do we stop branching?
By using a stopping rule. This can be as simple as stopping after a certain number of levels, minimum number of observations, and the minimum amount of fit improvement (complex parameter)
What are terminal nodes?
The set of nodes after the algorithm stops
How do we pick the best possible model with regression trees?
Find the predictor AND the cutoff that yields the smallest RMSE
What are the differences between OLS and CART?
The output for OLS are coefficient estimates and CART are a corresponding x values and average y values
How can we improve the CART method?
- Early stopping rule (short-sighted: a split now may not be as important as a split later)
- Prune a tree
How do we prune a tree?
- Grow a tree, with a small stopping rule
- Prune it afterwards, then let it grow again
Define cost complexity pruning.
At each level, delete one node, the one that brings the least improvement to the model. And stop when we cannot imporve the model fit by deleting any of the nodes
Define variable importance.
Measures how much fit improve when a particular x variable is used for splitting, summed across all splits in which that variable occurs
ie: measures how important a variable is in improving the prediction
T/F: We want low variable importance in a model.
False: want high variable importance to determine if a variable should be included in the model
T/F: Variable importance for a regression tree is a post prediction diagnostic tool, and therefore should be performed on the holdout set.
True
Why is a regression tree a regression?
It calculates the average y as a function of x