Chapter 9- Decision Trees Flashcards

1
Q

is the process of automatically creating a tree from data is recursive, true or false?

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

give the ID3 algorithm

A

base case:
if depth==0 or all labels have the same label
return most common label in the subsample

recursive case:
for each feature:
    split the data
    calculate the cost for this stump
pick feature with minimum cost

add left branch = build tree(left subsample, depth-1)
add right branch = build tree(right subsample, depth-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what do we risk as we split on fewer and fewer examples?

A

we could risk making new nested rules for few examples, which may just be outliers, just noise in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what two restrictions can we place on a decision tree to prevent overfitting?

A

maximum depth

minimum subsample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is a strong indicator that a decision tree is overfitted?

A

it is very deep

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is a decision tree splitting criterion?

A

split based on the number of errors, i.e. count the number of misclassified samples after split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is another word for information gain

A

mutual information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how do we determine which features would give us the most information?

A

measure the information gain as a score for each feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what does entropy measure?

A

entropy is a mathematical expression that quantifies the randomness- the amount of randomness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

the entropy for variable X, H(X)=?

A
  • sum for each x: p(x)logp(x)

log base 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the equation for information gain, I(T;W) = ?

A

H(T) - H(T|W)
or
H(W) - H(W|T)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

given that information gain is H(T) - H(T|W), and W can have two values, strong or weak. What is I(T;W) = H(T) - H(T|W)

A

calculate H(T), the entropy of T

calculate H(T|W=strong), the entropy of T where W is strong
calculate H(T|W=weak), the entropy of T where W is week

weighted average of H(T|W=strong) and H(T|W=weak),

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

we get the maximum possible information gain when….

A

H(T|W)=0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

the maximum possible value of the entropy of T, H(T) = ?

A

log(|T|)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is pruning?

A

truncating a decision tree to prevent overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is prepruning?

A

stop the decision tree from growing too large

17
Q

how can we create pre-pruning rules that were based on some knowledge about the training data set?

A

calculate the usefulness of each decision tree split at the point of computing the split- information gain

18
Q

what are the benefits of pre-pruning (2)?

A

reduce the likelihood of overfitting

reduce the overall training time

19
Q

what is the risk of prepruning?

A

it is possible for a decision to have low information gain but for child branches to have high information gain

20
Q

what is post-pruning?

A

deliberately allow the tree to overfit to the training data by allowing ID3 to run until all the training data are perfectly classified.

Then examine the performance on validation data

systematically remove branches to improve the validation accuracy until it matches the training accuracy.

21
Q

Two plausible leaf nodes (or base cases) for the id3 decision tree algorithm are:

A

When all labels are equal, or the minimum number of examples has been reached

22
Q

In the ID3 algorithm, we choose to split based on

A

mutual information (Information gain)

23
Q

what is pop in a decision tree leaf

A

the number of training points that arrived at that node

24
Q

what is err of a leaf

A

the fraction of the examples arriving at that node that are incorrectly classified

25
Q

how do we know the number of rules in a tree

A

count the number of possible paths.

26
Q

Describe the distribution of a random variable where we will get the lowest entropy

A

The lowest entropy is calculated for a random variable that has a single event with a probability of 1.0, a certainty.

27
Q

Describe the distribution of a random variable where we will get the highest entropy

A

The largest entropy for a random variable will be if all events are equally likely.