Decision trees Flashcards

1
Q

What is the approximate depth of a decision tree trained (without restrictions) on a training set with one million instances?

A

The depth of a well balanced binary tree containing m leaves is equal to:

log2(m) = log2(106) ≈ 20

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is a node’s Gini impurity generally lower or greater than its parent’s? is it generally lower/greater, or always lower/greater?

A

A node’s gini impurity is generally lower than its parent’s. This is due to the CART training algorithm’s cost function, which splits each node in a way that minimizes the weighted sum of its children’s gini impurities.

However, it is possible for a node to have a higher Gini impurity than its parent, as long as this increase is more than compensated for by a decrease in the other child’s impurity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

If a decision tree is overfitting the training set, is it a good idea to try decreasing max_depth?

A

If a decision tree is overfitting the training set,it may be a good idea to decrease max_depth, since this will constrain the model, regularizing it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If a decision tree is underfitting the training set, is it a good idea to try scaling the input features?

A

Decision trees do not care whether or not the training data is scaled or centered; That one of the nice thing about them. So if a decision tree underfits the training set, scaling the input features will just be a waste of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

If it takes one hour to train a decision tree on a training set containing 1 million instances, roughly how much time will it take to train another decision tree on a set with 10 million instances?

A

The computational complexity of training a decision tree is O(n*mlog(m)). So if you multiply the training set by 10 the training time will be:

n*10m*log(10m) = n*10m*(log(10)+log(m))= n*10m*log(10)+10(n*m)log(m)

= n*10m*log(10)+10 hours.

Since n*m*log(106) = 6nm we have n*10m*log(10) = 1.666

Therefore the total results is 1.66+10 = 11.66 hours.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

If your training set contains 100 000 instances, will setting presort = true speed up training?

A

Presorting the training set speeds up training only if the dataset is smaller than a few thousand instances. If it contain 100 000 instance, it will considerably slow down training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Let say you have a decision tree that looks like this figure. How does your model make predictions? And what gini,samples, value and class represents?

A

Its start at the root node (depth 0, at the top) then it works its way down, decision by decision, to the leaves. You can think of the decision tree as a collection of subtree.

Values: represent the number of instances in each categories that were found in the remaining sub-tree.

Sample: represent the total number of instances remaining in the sub-tree.

Class: tell you the name of the class if the decision is found to be true.

Gini: is a measure of its impurity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In the decision tree model, what is the Gini impurity?

A

The gini impurity is given by:

Gi = 1-Σk=1n(pi,k)2

pi,k is the ratio of class k instances among the training instances in the ith node. A note is pure (gini = 0) if all training instances it applies to belong to the same class.

For example if sample = 54, value = [0,49,5] and the given class is the second value than the gini impurity is equal to:

1-(0/54)2-(49/54)2-(5/54)2 ≈ 0.168.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What kind of decision tree are produce by the sklearn algorithm? What is the name of this algorithm?

A

The sklearn uses the CART algorithm, which produces only binary trees: nonleaf nodes always have two children (i.e., questions only have yes/no answers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the depth concept in a decision tree, as shown in the attached figure.

A

The thick vertical line represents the decision boundary of the root node (depth 0). Since the lefthand area is pure, it cannot be split any further. However, the righthand area is impure, so the depth 1 right node splits it using the dashed line. If max_depth is set to 2 it stop right there. If we set max_depth to 3, then the two depth-2 nodes would each add another decision boundary (represented by the dotted line)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what would be the estimated class probabilites in a decision tree, with values: [0,49,5].

A

0%

  1. 7% (49/54)
  2. 3% (5/54)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does the CART training algorithm work?

A

The classification and regression tree (CART) algorithm train decision tree by first splitting the training set into two subsets using a single feature k and a threshold tk.

It searches for the pair (k,tk) that produces the purest subsets (weighted by their size). Then it does the same for the subset recursively. It stops recursing once it reaches the maximum depth, or if it cannot find a split that will reduce impurity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the CART cost function for classification?

A

J(k,tk) = (mleft/m)Gleft + (mright/m)Gright

Where:

Gleft/right measures the impurity of the left/right subset.

mleft/right is the number of instance in the left/right subset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a greedy algorithm?

A

A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage[1] with the intent of finding a global optimum. In many problems, a greedy strategy does not usually produce an optimal solution, but nonetheless a greedy heuristic may yield locally optimal solutions that approximate a globally optimal solution in a reasonable amount of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 7 hyperparameter for a decision tree provided with sklearn?

A

1) max_depth: the maximum depth of the tree.
2) min_samples_split: the minimum number of samples required to split an internal node
3) min_samples_leaf: the minimum number of samples required to be at a leaf node.
4) min_weight_fraction_leaf: The minimum weighted fraction of the sum total of weight.
5) max_features: The maximum number of features that are evaluated for splitting at each node
6) max_leaf_node: The maximum number of leaf nodes.
7) min_impurity_decrease: a node will be split if this split induces a decrease of the impurity greater than or equal to this value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do we mean by the instability of a decision tree ?

A

The main issue with decision trees is that they are very sensitive to small variations in the training data. If you remove 1 training instance you may get a completly different tree. In fact, since the training algorithm used by sklearn is stochastic, you may get very different models even on the same training data (unless you set the random_state hyperparameter).

17
Q
A