Decision Trees Flashcards

1
Q

What do decision trees do?

A

Trees recursively split the feature space into (hyper)-rectangles and fit a constant function to each (hyper)-rectangle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are considered leaves and roots of decision trees?

A

Leaves are the terminal nodes and roots are the method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some cons of decision trees?

A

Trees have high variance - predictions vary a lot and they can be considered as unstable.

Trees tend to overfit the data

Trees lack smoothness for regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some pros of decision trees?

A

Trees are highly flexible!

Trees are non
-parametric!

Trees are invariant to scale and can handle
categorical predictors naturally!

Trees are INTERPRETABLE and easy to described to managers, marketers, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can we reduce variance in decision trees?

A

Bagging, Boosting or Random Forests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Bagging? Explain how it works.

A

Short for Bootstrap Aggregation.

Steps:
1. Bootstrap B datasets
2. Fit a deep tree to each dataset (there will be high variance)
3. Combine the trees and find the average of B predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you get the output of bagging?

A

Take a new element, find the estimate output from each tree, find the average output from all of the B trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does bagging reduce variance?

A

The variance of the average goes to sigma^2/B, when you find the average of random variables with variance sigma^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why does the bias remain unchanged after bagging?

A

Bagging works by averaging multiple decision trees trained on different subsets of the data. This averaging process reduces the variance because the individual trees’ errors tend to cancel each other out. However, since each decision tree has the same inherent bias, averaging them does not reduce the bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does bagging work for classification?

A

Each tree in the bag votes for a different class.

The result is a vector [p1 p2 p3…pB] where each pi is the proportion of the B trees voting for that class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does bagging compare for regression vs classification?

A

For regression, bagging reduces MSE.

For classification not always the case. Bagging a good classifier makes it better but bagging a bad classifier makes it worse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is one major concern if we do bagging with a lot of features?

A

Potentially only a handful of the features will be relevant and trees will be highly correlated.

Highly correlated trees mean that the variance does not get reduced as much. See below.

Correlation Between Models: Bagging works best when the individual models are uncorrelated. If the models are highly correlated, their errors will be similar, and averaging them won’t significantly reduce variance. The goal is to have diverse models whose errors can cancel each other out.

Reducing Correlation: Bagging reduces correlation by training each model on different subsets of the data. This helps ensure that the models make different errors, which improves the overall performance of the ensemble.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a Random Forest?

A

To reduce variance of decision trees.

Steps:
1. Bootstrap B datasets
2. For each tree, at each split randomise the subset of features
3. Take averages of the outputs as before

This helps to reduce the correlation between trees and therefore better deal with variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do experts prefer random forests?

A

They are easily parallelized

Low bias, and lower variance than something like a tree or bagging

We can examine how important a feature was to predicting the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can we validate a random forest model?

A

Not every observation (x,y) appears in each bootstrapped dataset.

If we use the trees that did not see (x,y) to make predictions on x, we can essentially validate how well our model is performing.
Call this the “Out of Bag” (OOB) error.

An OOB error estimate is almost identical to that obtained by N-fold cross-validation. Unlike many other nonlinear estimators, random forests can be fit in one sequence, with cross-validation being per-formed along the way.

Once the OOB error stabilizes, the training can be terminated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Adaptive Boosting?

A

Adaptive Boosting, commonly known as AdaBoost, is a powerful ensemble learning technique that combines multiple weak learners to create a strong classifier. Here’s how it works:

Initialization: AdaBoost starts by assigning equal weights to all training samples. These weights determine the importance of each sample in the training process.

Training Weak Learners: AdaBoost iteratively trains a series of weak learners (often decision stumps, which are simple decision trees with one split). In each iteration, it focuses on the samples that were misclassified by the previous weak learner.

Updating Weights: After each weak learner is trained, AdaBoost updates the weights of the training samples. Misclassified samples receive higher weights, making them more important in the next iteration. Correctly classified samples receive lower weights.

Combining Learners: Each weak learner is assigned a weight based on its accuracy. Learners with lower error rates receive higher weights. The final model is a weighted sum of all the weak learners.

Final Prediction: The ensemble makes predictions by taking a weighted vote of the predictions from all the weak learners. The weights ensure that more accurate learners have a greater influence on the final prediction.

The key idea behind AdaBoost is to focus on the samples that are hardest to classify, gradually improving the model’s performance by correcting errors made by previous learners.

17
Q

What is additive modeling?

A

In additive modeling, the final model is constructed by adding together the outputs of multiple simpler models. Each of these simpler models captures different aspects of the data.

18
Q

What is a white box model and a black box model? Give examples of both.

A

A white box model will have the explanation to the patterns directly on
the outputs of the model.
* Decision trees, GLMs in general, etc.

A black box model will not have them.
* Neural networks, XGB, Random Forest, etc.

19
Q

What is a variable importance plot?

A

These plots show the
statistical impact of the variables in the model (as measured by the Gini index).
These plots however do not provide any sort of explanation in terms of
individual cases.

As they are non-linear, different cases can be affected differently.

20
Q

What is a Shapley Value?

A

The Shapley Values are a well-known measure in financial modelling. It uses a game-theoretic
approach to provide explainability.

The Shapley Values is a proportion between the marginal contribution of the variable to a subset of variables divided by the number of variables in that subset, summed so that all possible combinations of variables are considered.

21
Q

What is TreeSHAP

A

As tree-based models calculate subsets of variables directly, we can calculate the Shapley Values over
tree cuts.
◦ This is MUCH faster! Polynomial instead (𝑶(𝑵𝟑) with 𝑵 the number of examples)

22
Q

What are the properties of Shapley Values?

A

◦ Local additivity: The Shapley Value of a subset of values is the sum of the values of each member of the
subset.
◦ Consistency/monotonicity: The importance of a set of values is larger than the importance of a smaller
subset of values that includes all of the original ones.
◦ Missingness: If an attribute importance is zero for all subsets, its Shapley Value will be zero.