Advanced Regression Flashcards
What contexts can trees be used in ?
-classification
-decision making
-logistic regression
Trees in regression
split regression into branches where each branch is a different regression that fits the data better
How it’s done in practice
-not a full regression
-simple regression only using constant term
-y = a0
a0 = sum i in node yi / # data points in node = avg response in node
Branching method
start with half the data
-for each leaf
-calculate variance
-split on each factor, find biggest variance decrease, make split for biggest decrease (if more than threshold)
repeat until no split decreases variance more than threshold
then using the other half of the data for each branching point
-calculate estimation error with and without branching
-if branching increases error, remove branching
Other branching methods
key idea
-using a metric related to the models quality
-find the best factor to branch wiht
-check: did this really improve the model?
if not, prune the branch back
rejecting a potential branch
-low imporvement benefit
-one side of the branch has too few data points
-rule of thumb- each leaf contains at least 5% of the original data
overfitting can be costly; make sure the benefit of each branch is greater than the cost
random forest differences from branching
-introduce randomness
-generate many different trees
-different strengths and weeknesses
-average may be better than a single tree?
How do we introduce randomness to random trees?
- bootstrapping process
give each tree a slightly different set of data
if we start with n data points, each tree gets n data points, although it could have multiple of the same one or none of another
2branching
-choose 1 factor at a time
-randomly choose a small number of factors, set X (commonly 1 + log(X))
-choose the best factor within X to branch on
- dont prune the tree
Results of random forest
-each tree in the forest has slightly different data
-end up with many different trees (usually500-1000) (random forest)
-each tree may give us a different regression model (which to choose?)
How do you pick the tree in the random forest?
You don’t use a single one!
-if it s regression tree you use the avg predicted response across all the trees in your forest.
-if its a classification tree, we use the mode ( most common predicted response)
Benefits of random forest
better overall estimates - some may overfit, they don’t all overfit the same way
-avgs between trees somewhat neutralizes over-fitting
drawback of random forest
-harder to explain/interpret the results
-doesn’t tell us how the variables interact or how a certain sequence of branches is helpful or meaningful - bc all branches are different
random forest good in what situation?
-black box model/ default
-no good reason to try something else
-not good for detailed insight into whats going on
Explainability/interpretability
how easy or difficult it is to know how models create their output
ex of explainability: linear regression
y = a0 + sum j = 1 to n ajxij
how is the value of y affected by different values of the predictors?
baseline is a0 and each coefficient controls the “weight” of each variable ie how much it impects the response
ex of explainability: linear regression tree
all of the results are conditional based on the branch they are in
-even harder with random forest
-it does give relative branching importance of each variable
-but not how, so not precise