4.2 Ensemble Methods Flashcards

1
Q

What are the two main ensemble methods discussed in the study manual, and how do they differ in their approach to combining multiple models?

A
  • Random forest: create boostrap samples and adjust a tree to each boostrap sample. Aims to decrase the variance by averaging the results of every tree (in regression trees) or taking the majority class (in classification trees)
  • Boosting: Combine multiple trees by adjusting a tree over the information not captured by the previous tree and aggegating the results of this sequentially trees. Aims to minimize the bias of a single tree.

The two main ensemble methods discussed are random forests and boosting. Random forests aggregate predictions from multiple independently built decision trees, while boosting grows trees sequentially, using information from previous trees to minimize a loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe the process of creating a random forest model, including the key steps involved in building and aggregating the individual decision trees.

A
  1. create b boostrap samples from te original training sample.
  2. Fit a decision tree to each boostrap sample, without pruning it and selecting a random subset of features to take int account in each split.
  3. Average the results of every tree (in regression trees) or take the majority class (in classification trees) o generate the results of the model

The process of creating a random forest model involves several steps. First, bootstrap samples are created from the original training dataset. Then, a decision tree is fitted to each bootstrapped dataset. Trees are allowed to grow large without pruning, and for each split, a random subset of features is considered. Finally, predictions are aggregated across all trees—averaged for regression problems and by majority vote for classification problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some of the important hyperparameters to consider when tuning a random forest model, and how do they affect the model’s performance?

A
  • Number of trees (b): More trees generally do not cause overfitting, but increase computation time.
  • Number of features considered at each split (k): Helps reduce correlation between trees. For regression, the default number of features considered at each split is one-third of the total predictors, while for classification, it is the square root of the number of predictors.
  • Minimum number of observations in terminal nodes: Controls individual tree size.
  • Maximum depth of terminal nodes: Limits tree complexity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the boosting algorithm, highlighting the key steps and how the model is updated iteratively.

A

The boosting algorithm begins by initializing predictions to zero. For a specified number of iterations, the gradient of the loss function (such as SSE for regression) is calculated. A new tree is then fitted to predict this gradient. Overall predictions are updated by adding a shrunken version of the new tree, and residuals are updated. The final boosted model is the sum of all trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Compare and contrast the hyperparameters used in the gbm and xgboost packages for boosting in R. What are some similarities and differences between the two implementations?

A

Hyperparameter gbm xgboost

Number of trees n.trees nrounds

Maximum depth of
terminal nodes interaction.depth max_depth

Shrinkage parameter shrinkage eta

Portion of observations
used for each tree bag.fraction subsample

Portion of predictors
used for each tree - colsample_bytree

Minimum number of
observations permitted
for a terminal node n.minobsinnode -

Minimum reduction
of splitting measure - gamma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do feature importance measures help in interpreting ensemble models like random forests and boosting? Describe the two main approaches for calculating feature importance.

A

feature importance measures gives relevant signs to identify the relevant predictors in a tree. Two main approaches include evaluating changes in model performance (such as decreases in SSE or Gini index) and measuring changes in accuracy after permuting predictor values. For random forest regression, feature importance can be measured by the mean decrease in the error sum of squares and the mean decrease in accuracy on out-of-bag (OOB) samples. For classification, it is measured by the mean decrease in the Gini index and the mean decrease in accuracy on OOB samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the purpose of partial dependence plots, and how are they constructed? What are some limitations of using partial dependence plots for model interpretation?

A

Partial dependence plots are used to visualize the marginal effect of a selected variable on the response, after integrating out other variables. The construction process involves:

  1. Select a predictor.
  2. Identify all possible values of the predictor in the training dataset.
  3. Modify the training dataset by setting all rows of the predictor to one of its possible values.
  4. Use the trained model to predict the target variable on the modified dataset.
  5. Average these predictions and record the corresponding predictor value used in step 3.
  6. Repeat steps 3-5 for all other possible values of the predictor.
  7. Plot the average predictions against the predictor values recorded during step 5.

Partial dependence plots have limitations, as they provide summaries rather than exact relationships, may mask important interactions, and some predictions may be unrealistic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the techniques of oversampling and undersampling in the context of addressing imbalanced classes in classification problems. When might you choose one technique over the other?

A

Oversampling and undersampling are techniques used to balance class distributions.

  • Oversampling: consist on duplicating observatins from the minority category until having rougly the same number of observations in both categories
  • Undersamplig: consist on removing observations of the mayority category until having rougly the same number of observations in both categories

The choice between these techniques depends on the size of the data—oversampling is more common for smaller datasets, while undersampling is used for very large datasets. These techniques should only be applied to the training set, not the entire dataset or test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can the train() function be used to optimize hyperparameters for random forest and boosted models? Provide an example of the key arguments and the interpretation of the cross-validation results.

A

In R, the train() function is used for hyperparameter optimization, allowing cross-validation to evaluate performance metrics such as RMSE and AUC across different hyperparameter values. For example, when building a random forest model, we can specify different values for the parameter mtry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Summarize the key advantages and disadvantages of random forests and boosting methods, considering factors such as predictive power, interpretability, and robustness to overfitting.

A

The advantages and disadvantages of random forests and boosting are as follows.

  • Random forests reduce variance, are robust, and handle various data types, but they are less interpretable than single decision trees.
  • Boosting often results in more predictive models and handles different data types and missing values well, but it tends to overfit more than random forests and is also less interpretable.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly