Decision Trees Flashcards

DS

1
Q

What are the common uses of decision tree algorithms?

A
  1. Classification
  2. Regression
  3. Measuring feature importance
  4. Feature selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the main hyperparameters that you can tune for decision trees?

A

Generally speaking, we have the following parameters:

max_depth: maximum tree depth

min_samples_split: minimum number of samples for a node to be split

min_samples_leaf: minimum number of samples for each leaf node

min_samples_leaf: minimum number of samples for each leaf node

max_leaf_nodes: the maximum number of leaf nodes in a tree

max_features: the maximum number of features that are evaluated for splitting at each node (only valid for algorithms that randomize features considered at each split)

Other similar features may be derived from the above hyperparameters.

The “traditional” decision tree is greedy and looks at all features at each split point, but many modern implementations allow splitting on randomized feature (as seen in sklearn), so max_features may or may not be a tuneable parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain how each hyperparameter affects the model’s ability to learn.

A

Generally speaking,

max_depth: increasing max_depth will increase variance and decrease bias

min_samples_split: increasing min_samples_split (the minimum number of samples for a node to be split) decreases variance and increases bias (regulates overfitting)

min_samples_leaf: increasing min_samples_leaf (the minimum number of samples for each leaf node) decreases variance and increases bias (regulates overfitting)

max_leaf_nodes: increasing max_leaf_nodes increases variance and decreases bias

max_features: increasing max_features increases variance and decreases bias

There may be instances when changing hyperparameters has no effect on the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What metrics are usually used to compute splits?

A

Gini impurity or entropy. Both generally produce similar results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Gini impurity?

A

Gini impurity (aka Gini index) is a measurement of how often a randomly chosen record would be incorrectly classified if it was randomly classified using the distribution of the set of examples (i.e. metric for IMPURITY).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What do high and low Gini scores mean?

A

Low Gini (near 0) = most records from the sample are in the same class (LOW IMPURITY).

High Gini (maximum of 1 or less, depending on number of classes) = records from sample are spread evenly across classes (HIGH IMPURITY)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is entropy?

A

Entropy is a measure of purity among non-empty classes. It is very similar to Gini in concept, but a slightly different calculation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do high and low entropy mean?

A

Low entropy (near 0) = most records from the sample are in the same class.

High entropy (maximum of 1) = records from sample are spread evenly across classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are decision trees parametric or non-parametric models?

parametric refers to whether the model parameters are determined before creating the model

A

Decision trees are non-parametric. The number of model parameters is not determined before creating the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some ways to reduce overfitting in decision trees?

A

To reduce flexibility of a decision tree:

  1. reduce maximum depth
  2. increase min_samples_split
  3. balance your data to prevent bias toward dominant classes
  4. Increase the number of samples
  5. Decrease the number of features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is feature importance evaluated in decision tree-based models”

A

The features that are split on most frequently and are closest to the top of the tree, thus affecting the largest number of samples, are considered the most important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain what node purity means.

A

Each leaf node contains a yes/no prediction.

When a leaf node does not have 100% yes or 100% no, we call the node “impure”.

To consider which node is best, we need a way to measure and compare “impurity.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is Gini impurity computed for a leaf node?

A

e.g. given possible nodes [chest pain], [blood circulation], [blocked arteries]

Gini Impurity = 1 - P(“yes”)^2 - P(“no”)^2

e.g. [chest pain]
/ \
heart disease heart disease
“yes” “no” “yes” “no”
105 39 34 125
gini: 0.395 0.336

left node Gini = 1 - (105 / 105+39)^2 - (125 / 105+39)
= 0.395

right node Gini = 1 - (34/34+125)^2 - (125 / 34+125)^2
= 0.336

Thus, total Gini impurity for using chest pain to separate patients with and without heart disease is the wtd avg of the leaf node impurities.

total number left = 144
total number right = 159

Gini impurity for chest pain = wtd avg of Gini impurities for the leaf nodes

= 0.395 * (144 / 144+159) + 0.336 * (159/144+159)
= 0.364

fast forward:
Gini impurity Chest Pain = 0.364
Gini impurity Good Blood Circulation = 0.360
Gini impurity Blocked Arteries = 0.381

thus:
Good Blood Circulation had LOWEST Gini impurity, so Good Blood Circulation will be chosen as node of that particular tree.

As attributes are selected for each node, the number of observations feeding to next level are reduced as nodes approach leaves.

For continuous attributes, select node values as the midpoint between given numeric values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain decision tree classifiers simply and how to use them

A

One reason for the popularity of tree-based models is their interpretability. In fact, decision trees can literally be drawn out in their complete form to create a HIGHLY INTUITIVE MODEL. From this basic tree system comes a wide variety of extensions from random forests to stacking.

Training a decision tree:

Decision tree learners attempt to find a decision rule that produces the greatest DECREASE in IMPURITY AT A NODE. While there are a number of measurements of impurity, by default DecisionTreeClassifier uses GINI impurity, G(t) = 1 - sum pi^2 where G(t) is the Gini impurity at node t and pi is the PROPORTION of observations of class c at node t. This process of finding decision rules to increase impurity is repeated recursively until all leaf nodes are PURE (CONTAIN ONLY ONE CLASS) or some arbitrary cut-off is reached.

If we want to use a different impurity measure than Gini, we can specify that in the “criterion” hyperparameter.

from sklearn import datasets

from sklearn.tree import DecisionTreeClassifier

iris = datasets.load_iris()
X,y = iris.data,iris.target

tree = DecisionTreeClassifier(random_state=0)

model = tree.fit(X,y)

# make new observation
observation = [[5,4,3,2]]
# get prediction on new obs
display(model.predict(observation)) #array([1])
# view predicted probas
display(model.predict_proba(observation))# array([0.,1.,0.])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain how a tree regressor works

A

Decision tree regression works similarly to decision tree classification; however, instead of reducing Gini impurity or entropy, potential splits are by default measured on how much they reduce MSE.

Just like DecisionTreeClassifier, we can use the hyper “criterion” to select the desired measurement of split quality, e.g. we can construct a tree whose splits reduce mean abs error (MAE).

from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets

# load daa with two features
boston = datasets.load_boston()
X,y = boston.data[:,0:2], boston.target
# create decision tree classifier obj
tree = DecisionTreeRegressor(random_state=0)
# train model
model = tree.fit(X,y)
# make new observation
new_obs = [[.02,16]]
# predict new obs target
model.predict(new_obs)  # array([33.])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain how to visualize a decision tree learning algorithm

A

One of the advantages of decision tree clfs is that we can viz the entire trained model–making decision trees one of the most interpretable models in ML.

In our solution, if we look at the root node we can see that the decision rule is that if petal width is less than or equal 0.8, then traverse left, else traverse right.

We can also see the Gini impurity index (0.667), the number of observations (150), the number of obs in each class ([50,50,50]), and the class the observations would be predicted if we stopped at that node (setosa).

We can also see that at that node the learner found that a single decision rule (petal width cm <= 0.8) was able to perfectly identify all of the setosa class observations. Furthermore, with one more decision rule with the same feature (petal width cm <= 1.75) the decision tree is able to correctly classify 144 of 150 observations. This makes petal width a very important feature!

from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets

# load daa with two features
boston = datasets.load_boston()
X,y = boston.data[:,0:2], boston.target
# create decision tree classifier obj
tree = DecisionTreeRegressor(random_state=0)
# train model
model = tree.fit(X,y)
# make new observation
new_obs = [[.02,16]]
# predict new obs target
model.predict(new_obs) # array([33.])
17
Q

In clf decision trees, how does the algo determine which feature to split on?

A

The resulting values at the leaf nodes may be mixed with yes,no so the leaf called “impure”.

To determine which separation is best we need a way to measure and compare “impurity”.

We can use Gini impurity, where Gini impurity :=
1 - P(“yes “)^2 - P(“no”)^2

Low/high Gini indicates high/low impurity.

The tree algo selects the feature with the lowest Gini impurity

18
Q

What is Gini impurity and how do we compute it?

A

Gini is a metric that measures how “impure” a leaf node is in a decision tree.

Gini impurity = 1 - P(“yes”)^2 - P(“no”)^2

          [chest pain]
        /                      \ [heart disease]    [no heart disease]  yes:105, no:39      yes: 34, no: 125

left leaf gini = 1 - (100/144)^2 - (34/144)^2
= 0.395

right leaf gini = 1 - (34/159)^2 - (125/159)^2
= 0.336

Now, to get gini of the entire [chest pain] node, we take a weighted average of the left and right leaf ginis:

chest pain gini = wtd avg left gini + wtd avg right gini
= 144/(144+159) * 0.395 + 159/(144+159) * 0.336
= 0.364

Do the above for all possible nodes and assign splitting node as one which has minimum total gini impurity.

IF impurity is lower by NOT attempting another node split among all possible features, then HALT and make the node at this depth a LEAF NODE.

BAM!

19
Q

Explain how a clf decision tree algo works in assigning splits and when to halt.

A
  1. Calculate Gini impurity scores for ALL possible features to split on:
    2a. If the node wtd avg Gini score is less than wtd avg scores from ALL feature Gini scores, then HALT and make the node a LEAF node.
    2b. If separating the data results in a LOWER Gini score, then split node on the feature with the minimal wtd avg impurity score.
20
Q

What is the structural difference between clf and regression decision trees?

A

clf nodes are categorical and values in the leaves are True/False.

Regression nodes have inequalities and values in the leaves are continuous.

21
Q

How does a regression decision tree determine continuous inequality values in a leaf?

A

A regression tree assigns the AVERAGE TARGET y value of the observations in the data set that are PARTITIONED by the previous parents’ inequalities.

22
Q

How does a regression decision tree determine the continuous inequality value to split on at each tree node?

A

init a dict to store {x key : SSR values}

for each data point x in X, compute:

  1. the average TARGET, ybar given the THRESHOLD obs > x
  2. compute SSR, the sum squared RESIDUALS of yi - ybar GIVEN obs INEQUALITY threshold from step 1 inequality condition and store {x key : SSR value} in dict
  3. SELECT the THRESHOLD INEQUALITY and assign it the the NODE as the x key and inequality which had the minimal SSR value in dict
23
Q

What specific values are in the leaf nodes of a regression tree?

A

The leaf nodes store the AVERAGE value given all conditions in the prior parent nodes.

24
Q

How does the decision tree algo know when to assign a node as a terminal leaf?

A

When building the tree top-down, after the PRIOR DECISION THRESHOLDS narrow the observations down to FEWER than min_samples_split number of observations, then that node will be assigned as a LEAF node.