8 Trees Flashcards

1
Q

Define a decision tree in one sentence.

A

A non‑parametric model that recursively partitions the feature space into axis‑aligned regions and fits a constant prediction in each region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What two stopping criteria are most common when growing a tree?

A

Minimum number of observations in a node and maximum tree depth (or minimum impurity decrease).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are decision trees considered high‑variance models?

A

Small changes in the data can drastically alter top‑level splits, and those errors propagate down the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Bagging: list the three core steps.

A

1) Draw B bootstrap samples, 2) train one full tree per sample, 3) average (regression) or vote (classification) predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the expected % of unique observations left out of any bootstrap sample (≃OOB set)?

A

About 37 % (since P(not selected) ≈ e^{‑1}).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State the main purpose of Out‑of‑Bag (OOB) error.

A

Provides an internal, cross‑validation–like estimate of test error without a separate validation set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Random Forest: key extra randomisation beyond bagging?

A

At each split, consider only a random subset of features to choose the best split.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Effect of reducing the mtry (feature subset size) in Random Forest?

A

Lower correlation between trees → greater variance reduction but slightly higher bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why does boosting fit each new tree to residuals/gradients?

A

To correct the mistakes of the current ensemble, moving predictions in the direction of steepest loss decrease.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Give the additive model form produced by gradient boosting.

A

ŷ(x)=∑_{m=1}^M α_m f_m(x), where each f_m is a weak tree and α_m a shrinkage weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the role of the ‘learning rate’ (shrinkage) in boosting?

A

Scales each tree’s contribution; smaller rates require more trees but improve generalisation by reducing over‑fitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Contrast AdaBoost vs XGBoost in one sentence.

A

AdaBoost re‑weights observations to minimise exponential loss, whereas XGBoost fits trees to first/second‑order gradients of a chosen loss with regularisation and subsampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True or False: XGBoost always uses depth‑1 stumps as weak learners.

A

False – it typically uses depth‑3–8 trees; depth is tuneable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

List two advantages of Random Forests over a single deep tree.

A

Lower variance and built‑in OOB error estimate (also handles many features robustly).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When would you prefer boosting over Random Forests?

A

When the dataset is small-to‑medium, complex patterns exist, and you can afford careful hyper‑parameter tuning for maximal accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What global explainability tool is built into Random Forests?

A

Feature‑importance scores based on total impurity reduction.

17
Q

What does TreeSHAP guarantee about the sum of feature contributions for an instance?

A

They add up exactly to the difference between the instance’s prediction and the model’s overall expected prediction (local additivity).

18
Q

Give one disadvantage of AdaBoost on noisy datasets.

A

The exponential loss heavily up‑weights mislabeled/noisy observations, leading to overfitting.

19
Q

Name two hyper‑parameters that regularise XGBoost.

A

Learning rate (η) and L1/L2 penalties on leaf weights (λ_L1, λ_L2).

20
Q

Quick rule: how many features (mtry) are considered per split for classification RF by default?

A

⌈√p⌉ where p is the total number of features.