Chapter 28 Bagging and Random Forest Flashcards

Question 1

Q

WHAT IS THE BOOTSTRAP METHOD?

P136

Answer

A

The bootstrap is a powerful statistical method for estimating a quantity from a data sample, e.g. mean, STD, even quantities used in ML algorithms like learned coefficients

_________

Bootstrapping is a method of inferring results for a population from results found on a collection of smaller random samples (instances) of that population, using replacement during the sampling process. (When sampling one sample (instance), replacing it back in the dataset, so one sample can be picked multiple times)

Question 2

Q

WHAT IS BAGGING? P137

Answer

A

Bagging is the application of the bootstrap procedure to a high-variance ML algorithm, typically decision trees.

Question 3

Q

WHAT ARE THE STEPS OF BAGGING? P137

Answer

A

1- Create many (e.g. 100) random subsamples of our dataset with replacement
2- Train a CART model on each sample
3- Given a new dataset, calculate the average prediction from each model.

Ref

Question 4

Q

ARE THE TREES IN BAGGING, PRUNED? WHY? P137

Answer

A

When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are NOT pruned. These trees have high variance, which is an important characteristic of sub-models when combining predictions using bagging.

Question 5

Q

WHAT PROBLEM CAN THE GREEDY ALGORITHM OF DECISION TREES CAUSE IN BAGGED DECISION TREES? P138

Answer

A

A problem with decision trees is that they are greedy. They choose which variable to split on, using a greedy algorithm that minimizes error. As such, even with Bagging, the decision trees can have a lot of structural similarities and in turn, result in high correlation in their predictions. Combining predictions from multiple models in ensembles work better if the predictions from the sub-models are uncorrelated or at best, weakly correlated.

Question 6

Q

WHAT DOES RANDOM FOREST DO TO AVOID GREEDINESS? P138

Answer

A

The decision trees look through all features to find the best split-point, random forest randomly chooses a sample of the features of which to search.

Question 7

Q

WHAT IS A GOOD DEFAULT FOR THE NUMBER OF FEATURES HYPERPARAMETER IN RANDOM FOREST? P138

Answer

A

For classification: √p p= number of all the features
For regression: p/3

Question 8

Q

WHEN USING BOOTSTRAP METHOD, WHAT ARE THE SAMPLES THAT ARE NOT PICKED CALLED? P138

Answer

A

Out-Of-Bag samples (OOB)

Question 9

Q

HOW IS THE PERFORMANCE OF EACH MODEL IN BAGGING AND THE WHOLE ENSEMBLE MEASURED? WHAT IS THIS ESTIMATE OF PERFORMANCE CALLED? P138

Answer

A

The performance of each model on its left out samples, when averaged, can provide an estimated accuracy of the bagged models. It’s often called the OOB estimate.

Question 10

Q

HOW CAN WE PREPARE DATA FOR BAGGED CART? P139

Answer

A

Bagged CART does not require any special data preparation other than a good representation of the problem.

Chapter 28 Bagging and Random Forest Flashcards

(10 cards)