Chapter 28 Bagging and Random Forest Flashcards
WHAT IS THE BOOTSTRAP METHOD?
P136
The bootstrap is a powerful statistical method for estimating a quantity from a data sample, e.g. mean, STD, even quantities used in ML algorithms like learned coefficients
_________
Bootstrapping is a method of inferring results for a population from results found on a collection of smaller random samples (instances) of that population, using replacement during the sampling process. (When sampling one sample (instance), replacing it back in the dataset, so one sample can be picked multiple times)
WHAT IS BAGGING? P137
Bagging is the application of the bootstrap procedure to a high-variance ML algorithm, typically decision trees.
WHAT ARE THE STEPS OF BAGGING? P137
1- Create many (e.g. 100) random subsamples of our dataset with replacement
2- Train a CART model on each sample
3- Given a new dataset, calculate the average prediction from each model.
ARE THE TREES IN BAGGING, PRUNED? WHY? P137
When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are NOT pruned. These trees have high variance, which is an important characteristic of sub-models when combining predictions using bagging.
WHAT PROBLEM CAN THE GREEDY ALGORITHM OF DECISION TREES CAUSE IN BAGGED DECISION TREES? P138
A problem with decision trees is that they are greedy. They choose which variable to split on, using a greedy algorithm that minimizes error. As such, even with Bagging, the decision trees can have a lot of structural similarities and in turn, result in high correlation in their predictions. Combining predictions from multiple models in ensembles work better if the predictions from the sub-models are uncorrelated or at best, weakly correlated.
WHAT DOES RANDOM FOREST DO TO AVOID GREEDINESS? P138
The decision trees look through all features to find the best split-point, random forest randomly chooses a sample of the features of which to search.
WHAT IS A GOOD DEFAULT FOR THE NUMBER OF FEATURES HYPERPARAMETER IN RANDOM FOREST? P138
For classification: √p p= number of all the features
For regression: p/3
WHEN USING BOOTSTRAP METHOD, WHAT ARE THE SAMPLES THAT ARE NOT PICKED CALLED? P138
Out-Of-Bag samples (OOB)
HOW IS THE PERFORMANCE OF EACH MODEL IN BAGGING AND THE WHOLE ENSEMBLE MEASURED? WHAT IS THIS ESTIMATE OF PERFORMANCE CALLED? P138
The performance of each model on its left out samples, when averaged, can provide an estimated accuracy of the bagged models. It’s often called the OOB estimate.
HOW CAN WE PREPARE DATA FOR BAGGED CART? P139
Bagged CART does not require any special data preparation other than a good representation of the problem.