Chapter 8 - Bagging, Random Forest, Boosting Flashcards
Bagging - what is it? when do we do it?
Bootstrap Aggregating. In bootstrap, we replicate our dataset by sampling with replacement. In bagging , we average the predictions of a model fit to many bootstrap samples.
when do we bag?
- when regression method or classifier has a tendency to overfit, bagging reduces variance of prediction.
- when n is large, the empirical distribution is similar to the true distribution
- bootstrap samples are like independent realizations of the data
- bagging amounts to averaging the fits from many independent datasets (could reduce variance by 1/B)
Bagging Decision Trees
disadvantage: each tree you make from a BS sample could be completely different
1. for each predictor, add up the total amt by which the RSS (or Gini index) decreases every time we use the predictor in Tb
2. average this total over each BS estimate
Out of Bag Error
to estimate the test error of a bagging estimate, we could use CV, but each time we do BS we only use ~ 63% of the data.
Idea: use the rest of the observations as a test set
- OOB error: for each sample x_i, find the prediction yhat_b_i for all bootstrap samples b which do not contain x_i (there should be around 0.37 B). average this predictions for yhat_oob_i
- compute the error (yi - yoob)^2
- average the error over all observations
- when n is large this is equivalent to LOOCV
- test error decreases as we increase B
The problem with bagging trees
the trees by different BS samples can be similar
Random Forest
- fit a decision tree to different BS samlples
- when growing the tree, we select a random sample of m < p predictors to consider in each step
- this leads to very different (or “uncorrelated”) trees for each sample
- finally average the prediction of each tree (increasing bias but decreasing variance)
- choosing m: optimal is usually around sqrt(p) but it can be used as a tuning parameter
Boosting (don’t really understand this)
set fhat(x) = 0, and ri = yi for i = 1, ..., n for b = 1 to B iterate: - fit a decision tree fhat_b with d splits to the response r1,...,rn - update the prediction to fhat(x) <- r_i - lambda fhat_b (x) output the final model fhat(x) = sum[lambda*fhat_b(x)
boosting learns slowly: we first use the samples that are easiest to predict, we then downweigh these cases and move on to harder cases
tuning using lambda, d, and B