Chapter 9- Decision Trees Flashcards

Question 1

Q

is the process of automatically creating a tree from data is recursive, true or false?

Question 2

Q

give the ID3 algorithm

Answer

A

base case:
if depth==0 or all labels have the same label
return most common label in the subsample

recursive case:
for each feature:
    split the data
    calculate the cost for this stump
pick feature with minimum cost

add left branch = build tree(left subsample, depth-1)
add right branch = build tree(right subsample, depth-1)

Question 3

Q

what do we risk as we split on fewer and fewer examples?

Answer

A

we could risk making new nested rules for few examples, which may just be outliers, just noise in the data

Question 4

Q

what two restrictions can we place on a decision tree to prevent overfitting?

Answer

A

maximum depth

minimum subsample size

Question 5

Q

what is a strong indicator that a decision tree is overfitted?

Answer

A

it is very deep

Question 6

Q

what is a decision tree splitting criterion?

Answer

A

split based on the number of errors, i.e. count the number of misclassified samples after split

Question 7

Q

what is another word for information gain

Answer

A

mutual information

Question 8

Q

how do we determine which features would give us the most information?

Answer

A

measure the information gain as a score for each feature

Question 9

Q

what does entropy measure?

Answer

A

entropy is a mathematical expression that quantifies the randomness- the amount of randomness

Question 10

Q

the entropy for variable X, H(X)=?

Answer

A

sum for each x: p(x)logp(x)

log base 2

Question 11

Q

what is the equation for information gain, I(T;W) = ?

Answer

A

H(T) - H(T|W)
or
H(W) - H(W|T)

Question 12

Q

given that information gain is H(T) - H(T|W), and W can have two values, strong or weak. What is I(T;W) = H(T) - H(T|W)

Answer

A

calculate H(T), the entropy of T

calculate H(T|W=strong), the entropy of T where W is strong
calculate H(T|W=weak), the entropy of T where W is week

weighted average of H(T|W=strong) and H(T|W=weak),

Question 13

Q

we get the maximum possible information gain when….

Question 14

Q

the maximum possible value of the entropy of T, H(T) = ?

Question 15

Q

what is pruning?

Answer

A

truncating a decision tree to prevent overfitting

Question 16

Q

what is prepruning?

Answer

A

stop the decision tree from growing too large

Question 17

Q

how can we create pre-pruning rules that were based on some knowledge about the training data set?

Answer

A

calculate the usefulness of each decision tree split at the point of computing the split- information gain

Question 18

Q

what are the benefits of pre-pruning (2)?

Answer

A

reduce the likelihood of overfitting

reduce the overall training time

Question 19

Q

what is the risk of prepruning?

Answer

A

it is possible for a decision to have low information gain but for child branches to have high information gain

Question 20

Q

what is post-pruning?

Answer

A

deliberately allow the tree to overfit to the training data by allowing ID3 to run until all the training data are perfectly classified.

Then examine the performance on validation data

systematically remove branches to improve the validation accuracy until it matches the training accuracy.

Question 21

Q

Two plausible leaf nodes (or base cases) for the id3 decision tree algorithm are:

Answer

A

When all labels are equal, or the minimum number of examples has been reached

Question 22

Q

In the ID3 algorithm, we choose to split based on

Answer

A

mutual information (Information gain)

Question 23

Q

what is pop in a decision tree leaf

Answer

A

the number of training points that arrived at that node

Question 24

Q

what is err of a leaf

Answer

A

the fraction of the examples arriving at that node that are incorrectly classified

Question 25

Q

how do we know the number of rules in a tree

Answer

A

count the number of possible paths.

Question 26

Q

Describe the distribution of a random variable where we will get the lowest entropy

Answer

A

The lowest entropy is calculated for a random variable that has a single event with a probability of 1.0, a certainty.

Question 27

Q

Describe the distribution of a random variable where we will get the highest entropy

Answer

A

The largest entropy for a random variable will be if all events are equally likely.