Jupyter Notebook 2.4 - RandomForest and Ensembling Flashcards

1
Q

What are random forests, and how do they improve upon individual decision trees?

A

Random forests are an ensemble learning method that combines multiple decision trees to enhance prediction performance and reduce overfitting. Key points include:

Mitigating Sensitivity: Individual decision trees are sensitive to small changes in training data and prone to overfitting. Random forests address these issues by aggregating the predictions of multiple trees, making the model more robust.

Higher Performance: The ensemble of decision trees typically results in higher accuracy and generalization compared to a single tree. This improvement stems from the “wisdom of the crowd” concept, where the collective predictions of many models can outperform those of individual experts.

Bagging Technique: Bagging involves creating multiple bootstrap samples (random samples with replacement) from the training dataset. Each decision tree in the random forest is trained on one of these samples, introducing diversity among the trees.

Ensembling Techniques: Random forests are one type of ensemble method. Other techniques include voting, bagging, boosting, and stacking, each with its own approach to combining model predictions.

By leveraging the strengths of multiple decision trees, random forests provide a powerful and effective solution for both classification and regression tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How are random forests constructed to overcome the limitations of single decision trees?

A

Random forests are built to mitigate the sensitivity and overfitting associated with individual decision trees through the following steps:

Multiple Trees: A random forest typically consists of many decision trees (e.g., 500 trees) trained on different subsets of the training data.

Bootstrap Sampling (Bagging): Each tree is trained using a bootstrap sample of the data, which involves randomly selecting data points with replacement until reaching the desired number of samples (n_samples). This means some data points may appear multiple times in a single tree’s training set, while others may not be included at all.

Random Feature Selection: During the construction of each decision tree, only a random subset of features is considered at each split. This is controlled by the parameter max_features, which limits the number of features used to determine the best split at each node. This randomness in feature selection helps ensure that the trees are diverse.

Averaging Predictions: After training, the predictions of all the trees are averaged (for regression) or voted on (for classification) to produce the final prediction. This aggregation reduces the overall sensitivity and overfitting of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How are predictions made in random forests for regression tasks?

A

Each decision tree in the random forest makes a prediction based on the input data.
The final prediction for the random forest is obtained by averaging the predictions from all the trees. This approach helps reduce variance and improves accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are predictions made in random forests for classification tasks?

A

Each tree makes a prediction and provides a probability for that prediction.
A soft voting strategy is employed to combine the predictions:
The average of the predicted probabilities from all trees is calculated.
The class with the highest average probability is selected as the final prediction.
This method allows the model to account for the confidence level of each prediction, enhancing the decision-making process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the important parameters to consider when using random forests in practice?

A

the parameters acting as regularization for the decision trees discussed in the previous notebook: i.e. max_depth (arguably the most important one, and often the only regularization set to non-default values), max_features, min_samples_split, min_samples_leaf, max_leaf_nodes, min_impurity_decrease, min_weight_fraction_leaf.
n_estimators: the number of decision trees in the forest. Increase to get increased model expressiveness.
n_jobs: set to -1 to use all available CPUs. Useful when training random forests on large data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is boosting in machine learning?

A

Boosting is an ensemble technique that adds new models sequentially to improve predictions. Each new model is trained on the errors made by the existing models, learning from their mistakes. It often involves weak learners, like decision trees, to create a strong overall learner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are two main boosting techniques?

A
  1. AdaBoost: Each additional tree focuses on instances misclassified by previous trees.
  2. Gradient Boosting: Each tree predicts the residual error of the previous tree, aiming to correct errors in the ensemble’s predictions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does AdaBoost work to improve model accuracy?

A

AdaBoost assigns initial equal weights to each training instance. It trains a weak learner (typically a decision stump) on this data and calculates its error rate to determine its weight. Misclassified instances receive increased weights based on the learner’s performance. This process is repeated for a specified number of learners, each focusing on the difficult cases identified by previous models. Predictions are made by aggregating the weighted predictions of all learners, improving overall accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly