Random Forest Flashcards
Random Forest in 1 Sentence
An ensemble method uses random features and bagged data to create some number of uncorrelated trees to be able to make predictions.
Tree Depth
The depth of which the trees subsplit. You can indicate a maximum depth of subsplits to limit the amount of overfitting.
Minimum Samples per Split
vs
Min Samples per Leaf
Both are ways to prune the tree.
Per Split: Minimum number of samples to split on at a node.
Per Leaf: Require a minimum number of samples per leaf that a split will result in. Tuning this parameter has effect of smoothing predictions since it wont result in (near) empty leaves.
Gini Impurity
A method to calculate the purity of a split. It is measurement of the likelihood of an incorrect classification of a new sample if the new sample were randomly classified according to distribution of class labels. Individual splits are calculated based on the amount of Gini Gain for each Gini split.
https://youtu.be/7VeUPuFGJHk?t=391
OOB Score
Out of Bag Score
Random Forest’s internal accuracy calculation on the predictions of samples that are left out of component decision trees. Roughly equivalent to accuracy
EG: DT1 is trained on 2/3rds of data. The remaining 1/3rd is predicted by DT1. The accuracy of those predictions is OOB score.
Maximum Features (RF)
Number of features to consider when making a split. Typically sqrt(n_features) or log(n_features).