8 Trees Flashcards
Define a decision tree in one sentence.
A non‑parametric model that recursively partitions the feature space into axis‑aligned regions and fits a constant prediction in each region.
What two stopping criteria are most common when growing a tree?
Minimum number of observations in a node and maximum tree depth (or minimum impurity decrease).
Why are decision trees considered high‑variance models?
Small changes in the data can drastically alter top‑level splits, and those errors propagate down the tree.
Bagging: list the three core steps.
1) Draw B bootstrap samples, 2) train one full tree per sample, 3) average (regression) or vote (classification) predictions.
What is the expected % of unique observations left out of any bootstrap sample (≃OOB set)?
About 37 % (since P(not selected) ≈ e^{‑1}).
State the main purpose of Out‑of‑Bag (OOB) error.
Provides an internal, cross‑validation–like estimate of test error without a separate validation set.
Random Forest: key extra randomisation beyond bagging?
At each split, consider only a random subset of features to choose the best split.
Effect of reducing the mtry (feature subset size) in Random Forest?
Lower correlation between trees → greater variance reduction but slightly higher bias.
Why does boosting fit each new tree to residuals/gradients?
To correct the mistakes of the current ensemble, moving predictions in the direction of steepest loss decrease.
Give the additive model form produced by gradient boosting.
ŷ(x)=∑_{m=1}^M α_m f_m(x), where each f_m is a weak tree and α_m a shrinkage weight.
What is the role of the ‘learning rate’ (shrinkage) in boosting?
Scales each tree’s contribution; smaller rates require more trees but improve generalisation by reducing over‑fitting.
Contrast AdaBoost vs XGBoost in one sentence.
AdaBoost re‑weights observations to minimise exponential loss, whereas XGBoost fits trees to first/second‑order gradients of a chosen loss with regularisation and subsampling.
True or False: XGBoost always uses depth‑1 stumps as weak learners.
False – it typically uses depth‑3–8 trees; depth is tuneable.
List two advantages of Random Forests over a single deep tree.
Lower variance and built‑in OOB error estimate (also handles many features robustly).
When would you prefer boosting over Random Forests?
When the dataset is small-to‑medium, complex patterns exist, and you can afford careful hyper‑parameter tuning for maximal accuracy.
What global explainability tool is built into Random Forests?
Feature‑importance scores based on total impurity reduction.
What does TreeSHAP guarantee about the sum of feature contributions for an instance?
They add up exactly to the difference between the instance’s prediction and the model’s overall expected prediction (local additivity).
Give one disadvantage of AdaBoost on noisy datasets.
The exponential loss heavily up‑weights mislabeled/noisy observations, leading to overfitting.
Name two hyper‑parameters that regularise XGBoost.
Learning rate (η) and L1/L2 penalties on leaf weights (λ_L1, λ_L2).
Quick rule: how many features (mtry) are considered per split for classification RF by default?
⌈√p⌉ where p is the total number of features.