Random Forest Flashcards

DS

1
Q

How does Random Forest differ from traditional tree algorithms?

A

Random Forest is an ensemble method that uses bagged decision trees with a reduced number of random feature subsets chosen at each split to prevent selection of the same features resulting in highly correlated resamples.

It then either averages the prediction results of each tree (regression) or using votes from each tree (classification) to make the final prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What hyperparameters can be tuned for a Random Forest that are in addition to each individual tree’s hyperparameters?

A

A Random Forest is essentially bagged (resampled) decision trees with random feature subsets chosen at each split point, so we have 2 new hyperparameters that we can tune:

num_estimators: the number of decision trees in the forest.

max_features: maximum number of features that are evaluated for splitting each node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Are Random Forests prone to overfitting? Why?

A

No, Random Forests are NOT prone to overfitting because the bagging (resampling to create subset trees) and randomized feature selection tends to average out any noise in the model. Adding more trees does not cause overfitting since the randomization process continues to average out noise. (more trees generally reduces overfitting in random forests).

In general, bagging algorithms are robust to overfitting.

having said that, it is possible to overfit with Random Forest models if the underlying decision tree have extremely high variance, e.g., extremely high max_depth and low min_samples_split, and a large percentage of samples are considered at each split point, e.g. if every tree is identical, then random forests may overfit the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain how Random Forests are constructed.

A
  1. Bootstrap a sample data set by RANDOMLY selecting ROWS from the given data set.
  2. For each feature node split, only take a RANDOM SAMPLE SUBSET of FEATURES in order to de-correlate the splits and proceed by selecting feature for each node by lowest Gini impurity.
  3. LOOP BACK to step 1 to RESAMPLE rows for another BOOTSTRAPPED data set, CONSIDER only a randomized SUBSET of features to build another tree in part 2 to build a VARIETY of BOOTSTRAPPED decision trees. The VARIETY is what makes RF outperform an individual tree.

“Run the data” down through EACH individual tree (bag) resulting from each BOOTSTRAPPED data set to arrive at an INDIVIDUAL decision. Finally, average all of the INDIVIDUAL tree decisions in the forest ensemble to obtain a majority vote as the FINAL classification (terminology: bagging).

(We have the option to test trained predictor on out-of-bag data, where 1/3 of data goes unselected).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain what the three main random forest hyperparameters do

A

A common problem with decision trees is that they tend to fit the training data TOO CLOSELY, i.e. overfitting. This has motivated the widespread use of an ensemble learning method called RANDOM FOREST. In RF, MANY decision trees are trained, but each tree receives only a BOOTSTRAPPED sample of observations (i.e. a RANDOM SAMPLE of observations with REPLACEMENT that matches the original number of observations) and each node only considers a SUBSET of features when determining the BEST SPLIT. This forest of RANDOMIZED decision trees (hence the name) VOTES to determine the predicted class.

Being a forest rather than a single decision tree, RF has certain parameters that are unique to RFs or particularly important

FIRST, max_features parameter determines the max number of FEATURES to CONSIDER AT EACH NODE AND TAKES a NUMBER OF ARGUMENTS including int (number of features), floats (percentage of features), sqrt (square root of number of features). By default, max_features is set to “auto” which equals “sqrt”

SECOND, the bootstrap parameter allows us to set whether the subset of observations considered for a tree is created using sampling with replacement (default setting) or without replacement.

THIRD, n_estimators parameter sets the NUMBER OF weak DECISION TREES to include in the FOREST.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain how decision trees identify feature importance

A

One of the major benefits of decision trees is interpretability, where specifically, we are able to viz the ENTIRE model.

However, a random forest model is comprised of MANY INDIVIDUAL decision trees. This makes a simple intuitive viz of RF impractical. That said, there is an alternative option: we can compare and viz the RELATIVE IMPORTANCE of EACH FEATURE.

In our single decision tree viz example, we saw that decision rules based only on petal width were able to classify many observations correctly. Intuitively, THIS MEANS PETAL WIDTH is an IMPORTANT FEATURE in our classifier. More formally, FEATURES with SPLITS that have the GREATER MEAN DECREASE in IMPURITY (e.g. Gini, entropy impurity in clfs and variance in reg) ARE CONSIDERED MORE IMPORTANT.

However, there are two things to keep in mind regarding feature importance.

First, sklearn requires that we break up nominal categorical features into multiple binary features, which has the effect of SPREADING the IMPORTANCE of that FEATURE ACROSS ALL of the BINARY FEATURES and can often make each feature appear to be unimportant even when the original nominal categorical feature is highly important.

Second, if two features are HIGHLY CORRELATED, ONE FEATURE will CLAIM MUCH of the IMPORTANCE, making the other feature appear to be FAR LESS important–which has implications for interpretation if not considered.

In sklearn, clf and reg decision trees and random forests can report RELATIVE IMPORTANCE of each feature using the feature_importances_ method.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets

iris = datasets.load_iris()
X,y = iris.data,iris.target

rf = RandomForestClassifier(random_state=0, n_jobs=-1)

# train model
clf = rf.fit(X,y)
# compute feature importances
importances = clf.feature_importances_
# sort feature importances in desc order
indices = np.argsort(importances)[::-1]
# rearrange feature names so they match the sorted feature importances
names = [iris.feature_names[i] for i in indices]
# create plot
plt.figure()

plt.title(‘Feature Importances’)

#add bars
plt.bar(range(X.shape[1]),importances[indices])
# add feature names as x axis labels
plt.xticks(range(X.shape[1]),names,rotation=90)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain how to do feature selection with random forests

A

There are situations where we might want to reduce the number of features in our model. e.g. we might want to REDUCE the model variance or we want to improve model INTERPRETABILITY by including only the most important features. In sklearn we can use a simple twp-stage workflow to create a model with reduced features.

First, we train a RF model using ALL features.

Second, we use the above model to identify important features

Third, we create a new feature matrix that includes only the selected important features, using the sklearn SelectFromModel method to create a feature matrix containing only features with an importance greater than or equal to some threshold value

Last, we create a new model using only those features

Caveats:
It must be noted there are two caveats to this approach. First, normal categorical features that have been one-hot-encoded will see the feature importance DILUTED across the binary features. Second, the feature importance of highly correlated features will be effectively assigned to one feature and not evenly distributed across both features.

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.feature_selection import SelectFromModel

iris = datasets.load_iris()
X,y = iris.data,iris.target
# init RF clf
rf = RandomForestClassifier(random_state=0, n_jobs=-1)
# create obj that selects with importance >= to a threshold
selector = SelectFromModel(rf,threshold=.3)
# new feature matrix using selector
X_important = selector.fit_transform(X,y)
# train random forest using most important features
model = rf.fit(X_important,y)
# new data
Xnew = [[1,2],[4,5],[.4,.5]]
# predict test
model.predict(Xnew)  # array([2,2,0])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

List the steps in the Random Forest algorithm

A

DO say 100 times:

  1. Create a bootstrapped data set (with replacement. Typically about 1/3 of original data does NOT end up in resampled data set)
  2. Create a decision tee with the bootstrapped data, but ONLY USE a RANDOM SUBSET of features (columns) at EACH STEP.

say we select max features =2, then for ea decision tree split, we allow Gini impurity comparisons between only 2 possible RANDOMLY selected CANDIDATE features to SPLIT a node.

after 100 iterations of steps 1-2, we have 100 trees with a VARIETY of OUTCOMES.

The VARIETY is what makes RF MORE EFFECTIVE than individual decision trees.

After running steps 1, 2 an arbitrary number of times, the final step is AGGREGATING the individual predictions from each weak decision tree to make an ensemble prediction, which is called BAGGING

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is an “out of bag” data set?

A

In the RF data bootstrap process, about 1/3 of data does not end up in the resampled DATA. This data is called the “OUT OF BAG” dataset.

A more accurate/descriptive name for this could have been “out of BOOT” since we are talking about step 1 boostrap data resampling, not the aggregation of random generated decision trees.

Since out of bag data was NOT used to train the indiv random decision trees, we can run the out of bag data through the individual random trees to validate the RF ensemble.

The proportion of out of bag misclassified samples is called OUT OF BAG ERROR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly