Lecture 6 - Decision Trees Flashcards

Question 1

Q

Guess Who Example

Answer

A

REFER TO SLIDES

Question 2

Q

What is a Split (or internal) node>

Answer

A

A node within the tree
- The split label (of the form Xj <= tj) indicates the left-hand branch resulting from that split
- Conversely, the right-hand branch corresponds to Xj > tj

Question 3

Q

What is a Branch?

Answer

A

Links or edges connecting nodes

Question 4

Q

What is a Leaf (or terminal) node?

Answer

A

A node at the end of a branch, where prediction is done.

Question 5

Q

How is a tree used to make predictions?

Answer

A

Use the tree to make a decision for a single instance by following the sequence of
questions and answers down from the node, left or right depending on the answer.
Class of leaf node is the output class.
Tree nodes also tell us number of training samples allocated to each node, and
broken down by class.
Gini Impurity -> REFER TO SLIDES FOR FORMULA

Question 6

Q

What is a White Box Classifier?

Answer

A

REFER TO SLIDES FOR DIAGRAM

Question 7

Q

What is Classification And Regression Tree (CART)

Answer

A

Classification And Regression Tree (CART) is the algorithm used to train Decision Trees.
- Divide the feature space - the set of possible values for X1, X2, …, Xn - into J distinct and non-overlapping regions, R1, R2, …, RJ
- Regression: Predict every instance that falls into the region Rj as the mean of the responsible values for the training instances in Rj
- Classification: Predict every instance that falls into the region Rj as the class with the highest probability given the training instances in Rj

Question 8

Q

The regios shapes are boxes, how are these boxes determined?

Answer

A

The boxes are determined by a recursive binary splitting
- Top down: it starts at the top of the tree (the root node) and goes down.
- Greedy algorithm: it searches for an optimal split at the node level, then repeats the process recursively at subsequent levels. But overall solution is not guaranteed to be optimal.
Thus, the algorithm:
- Finds the best predictor Xj and cutpoint tj that minimises the cost function (R1(j,tj) = {X|Xj <= tj} and R2(j,tj) = {X|Xj > tj}).
- Repeat the process for the new subspaces.
- Stop when it cannot find a split that will reduce impurity or when a stopping condition is satisfied (e.g. max depth hyperparameter).

Question 9

Q

CART Training Algorithm - Clasisification

Answer

A

REFER TO SLIDES FOR FORMULA

Question 10

Q

What is the computational complexity?

Answer

A

Making predictions requires traversing the Decision Tree from the root to a leaf.
- Traversing the Decision Tree requires going through roughly O(log2 (m)) nodes. Since each node only requires checking the value of one feature, the overall prediction complexity is just O(log2 (m)), independent of the number of features. So predictions are very fast, even when dealing with large training sets.
- However, the training algorithm compares all features (or less if max features is set) on all samples at each node. This results in a training complexity of O(nm log2 (m)).

Question 11

Q

When to use Impurity vs Entropy?

Answer

A

By default, the Gini impurity measure is used
-Entropy is an alternative (by setting the criterion hyperparameter to “entropy”) — From thermodynamics: entropy is a measure of molecular disorder
Frequently used as an impurity measure in ML: a set’s entropy is zero when it contains instances of only one class.
REFER TO SLIDES FOR FORMULA

Question 12

Q

How is overfitting avoided when using trees?

Answer

A

A smaller tree with fewer splits (regions) might lead to lower variance and better
interpretation at the cost of bias.
Therefore, we need to restrict the maximum depth or prune the decision three.
To avoid overfitting, we need to restrict DTs’ freedom during training, e.g., setting the max depth hyperparameter smaller.
- DecisionTreeClassifier has other similar hyperparameters to restrict the shape of the decision tree: min samples split, min samples leaf, min weight fraction leaf, max leaf nodes, max features.

Question 13

Q

Cost function for regression

Answer

A

REFER TO SLIDES FOR FORMULA

Question 14

Q

Prone to overfitting graphs

Answer

A

REFER TO SLIDES FOR DIAGRAM

Question 15

Q

Why are Decision Trees high variance models?

Answer

A

Small changes to the hyperparameters or to the data may produce very different models.
Even retraining the decision tree on the exact same data may produce a very different model.
Random forests (RF) may overcome this.

Question 16

Q

What is pruning?

Answer

A

A smaller tree with fewer splits (that is, fewer regions R1, …, RJ ) might lead to lower variance and better interpretation at the cost of a little bias.
We want to select the tree that gives the lowest error on the test set. One alternative is to estimate the cross-validation error for every possible subtree, but that can be impractical.
Another alternative it to grow a large tree T0, and then prune it back to obtain a subtree.
REFER TO SLIDES FOR FORMULAS