Chapter 8 - Bagging, Random Forest, Boosting Flashcards

1
Q

Bagging - what is it? when do we do it?

A

Bootstrap Aggregating. In bootstrap, we replicate our dataset by sampling with replacement. In bagging , we average the predictions of a model fit to many bootstrap samples.

when do we bag?

  1. when regression method or classifier has a tendency to overfit, bagging reduces variance of prediction.
  2. when n is large, the empirical distribution is similar to the true distribution
  3. bootstrap samples are like independent realizations of the data
  4. bagging amounts to averaging the fits from many independent datasets (could reduce variance by 1/B)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bagging Decision Trees

A

disadvantage: each tree you make from a BS sample could be completely different
1. for each predictor, add up the total amt by which the RSS (or Gini index) decreases every time we use the predictor in Tb
2. average this total over each BS estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Out of Bag Error

A

to estimate the test error of a bagging estimate, we could use CV, but each time we do BS we only use ~ 63% of the data.

Idea: use the rest of the observations as a test set

  • OOB error: for each sample x_i, find the prediction yhat_b_i for all bootstrap samples b which do not contain x_i (there should be around 0.37 B). average this predictions for yhat_oob_i
  • compute the error (yi - yoob)^2
  • average the error over all observations
  • when n is large this is equivalent to LOOCV
  • test error decreases as we increase B
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The problem with bagging trees

A

the trees by different BS samples can be similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Random Forest

A
  • fit a decision tree to different BS samlples
  • when growing the tree, we select a random sample of m < p predictors to consider in each step
  • this leads to very different (or “uncorrelated”) trees for each sample
  • finally average the prediction of each tree (increasing bias but decreasing variance)
  • choosing m: optimal is usually around sqrt(p) but it can be used as a tuning parameter
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Boosting (don’t really understand this)

A
set fhat(x) = 0, and ri  = yi for i = 1, ..., n
for b = 1 to B iterate:
- fit a decision tree fhat_b with d splits to the response r1,...,rn
- update the prediction to fhat(x) <- r_i - lambda fhat_b (x)
output  the final model
fhat(x) = sum[lambda*fhat_b(x)

boosting learns slowly: we first use the samples that are easiest to predict, we then downweigh these cases and move on to harder cases

tuning using lambda, d, and B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly