Trees and ensembles Flashcards

Question 1

Q

What are the differences between classification trees and regression trees in terms of output and node splitting criteria?

Answer

A

Classification trees predict categorical outcomes, using measures like the Gini Index or Cross-Entropy for node splitting, while regression trees predict continuous values and use sum of squared errors as the splitting criterion.

Question 2

Q

How is the Gini Index used in classification trees, and why is it preferred over misclassification error for node splitting?

Answer

A

The Gini Index measures impurity by quantifying how mixed the classes are within a node. It is preferred over misclassification error because it is more sensitive to changes in node purity, providing better splits during tree growth.

Question 3

Q

Explain the process of node splitting in regression trees. How is the optimal split chosen?

Answer

A

Node splitting in regression trees is based on minimizing the sum of squared errors across the child nodes. The best split is the one that results in the largest reduction in variance or the lowest prediction error for the resulting nodes.

Question 4

Q

What is CART, and how does it help in preventing overfitting in decision trees?

Answer

A

CART (Classification and Regression Trees) uses techniques like pruning and cross-validation to prevent overfitting. It prunes back a fully grown tree based on a cost-complexity measure that balances the tree’s complexity with its predictive power.

Question 5

Q

How does bootstrap aggregating (bagging) reduce variance in decision tree models, and what are its limitations?

Answer

A

Bagging reduces variance by averaging the predictions of multiple decision trees grown on different bootstrap samples of the data. However, it has limitations, such as trees being correlated when they share similar features, which can limit the variance reduction.

Question 6

Q

What is Random Forest, and how does it improve upon bagging?

Answer

A

Random Forest improves upon bagging by introducing feature sampling. In addition to using bootstrap samples, it selects a random subset of features at each node to reduce the correlation between trees, leading to better generalization.

Question 7

Q

Explain the concept of feature sampling in Random Forests. How does it help in decorrelating trees?

Answer

A

In Random Forest, a random subset of features is chosen at each node for splitting, which helps reduce the correlation between trees. This decorrelation leads to lower variance in the ensemble’s predictions.

Question 8

Q

How does AdaBoost work, and how does it differ from bagging in terms of weighting samples and classifiers?

Answer

A

AdaBoost assigns higher weights to misclassified samples and updates classifier weights to focus more on difficult samples in subsequent iterations. This contrasts with bagging, where all samples and classifiers are equally weighted.

Question 9

Q

Explain the mathematical formulation of Gradient Boosting. How does it sequentially improve the model?

Answer

A

Gradient Boosting minimizes a loss function by sequentially adding weak learners (e.g., shallow trees) that correct the residuals of previous models. Each model focuses on the errors made by the previous models.

Question 10

Q

Why is overfitting less of a concern in Random Forests compared to single decision trees?

Answer

A

Random Forest reduces overfitting because it averages predictions over many uncorrelated trees, which helps reduce the model’s variance. Overfitting is less of a concern due to the randomness introduced in both sampling and feature selection.

Question 11

Q

How does pruning help in decision tree models, and what is the cost-complexity criterion used for pruning?

Answer

A

Pruning involves removing sections of the tree that provide little predictive power to avoid overfitting. The cost-complexity criterion used for pruning balances the accuracy of the model with its simplicity by minimizing both the training error and the size of the tree.

Question 12

Q

Describe how the splitting criterion in classification trees is based on impurity measures such as Gini Index and Cross-Entropy.

Answer

A

The Gini Index measures how pure a node is by assessing the likelihood of a randomly chosen sample being misclassified. Cross-Entropy measures the difference between the true distribution of the classes and the predicted distribution, both aiming to find pure nodes.

Question 13

Q

What is the role of ‘out-of-bag’ (OOB) samples in Random Forests, and how are they used for model evaluation?

Answer

A

‘Out-of-bag’ (OOB) samples are the observations not included in a bootstrap sample used to train a particular tree in Random Forest. They are used to estimate the model’s error and tune hyperparameters without the need for a separate validation set.

Question 14

Q

How does the concept of ‘weak learners’ in Boosting contribute to the overall model performance?

Answer

A

In Boosting, weak learners are simple models that perform slightly better than random guessing. Combining these weak learners sequentially improves the overall model performance, as each learner corrects the mistakes of the previous ones.

Question 15

Q

What is the difference between AdaBoost and Gradient Boosting in terms of updating weights and minimizing loss functions?

Answer

A

AdaBoost updates the weights of misclassified samples to emphasize difficult observations, while Gradient Boosting minimizes a differentiable loss function (e.g., squared error) by fitting models sequentially on the residuals of the previous model.

Question 16

Q

Explain how feature importance is computed in Random Forests. Why is it useful in interpreting the model?

Answer

Study These Flashcards

A

Feature importance in Random Forests is computed by assessing the decrease in impurity (e.g., Gini Index) or prediction error when a feature is used for splitting. It helps in understanding which features contribute the most to model predictions.

Question 17

Q

What are the key hyperparameters in a Random Forest model, and how do they affect the performance of the model?

Answer

Study These Flashcards

A

The key hyperparameters in Random Forest include the number of trees, the number of features to consider for splitting at each node, the maximum depth of the trees, and the minimum samples per leaf. These affect the model’s variance, bias, and computation time.

Question 18

Q

In Gradient Boosting, how is the step size (learning rate) determined, and what is its impact on the model’s convergence?

Answer

Study These Flashcards

A

In Gradient Boosting, the step size or learning rate determines how much each new learner contributes to correcting the previous residuals. A smaller learning rate makes the model converge more slowly but usually results in better generalization.

Question 19

Q

How does Gradient Boosting handle residuals in each iteration, and why is it important for improving model accuracy?

Answer

Study These Flashcards

A

Gradient Boosting handles residuals by fitting a new model to the residuals (errors) of the previous model at each iteration. This iterative process reduces the errors in the model’s predictions, leading to better accuracy over time.

Trees and ensembles Flashcards

(19 cards)