Chapter 9- Decision Trees Flashcards
is the process of automatically creating a tree from data is recursive, true or false?
true
give the ID3 algorithm
base case:
if depth==0 or all labels have the same label
return most common label in the subsample
recursive case: for each feature: split the data calculate the cost for this stump pick feature with minimum cost
add left branch = build tree(left subsample, depth-1)
add right branch = build tree(right subsample, depth-1)
what do we risk as we split on fewer and fewer examples?
we could risk making new nested rules for few examples, which may just be outliers, just noise in the data
what two restrictions can we place on a decision tree to prevent overfitting?
maximum depth
minimum subsample size
what is a strong indicator that a decision tree is overfitted?
it is very deep
what is a decision tree splitting criterion?
split based on the number of errors, i.e. count the number of misclassified samples after split
what is another word for information gain
mutual information
how do we determine which features would give us the most information?
measure the information gain as a score for each feature
what does entropy measure?
entropy is a mathematical expression that quantifies the randomness- the amount of randomness
the entropy for variable X, H(X)=?
- sum for each x: p(x)logp(x)
log base 2
what is the equation for information gain, I(T;W) = ?
H(T) - H(T|W)
or
H(W) - H(W|T)
given that information gain is H(T) - H(T|W), and W can have two values, strong or weak. What is I(T;W) = H(T) - H(T|W)
calculate H(T), the entropy of T
calculate H(T|W=strong), the entropy of T where W is strong calculate H(T|W=weak), the entropy of T where W is week
weighted average of H(T|W=strong) and H(T|W=weak),
we get the maximum possible information gain when….
H(T|W)=0
the maximum possible value of the entropy of T, H(T) = ?
log(|T|)
what is pruning?
truncating a decision tree to prevent overfitting