Chapter 9- Decision Trees Flashcards
is the process of automatically creating a tree from data is recursive, true or false?
true
give the ID3 algorithm
base case:
if depth==0 or all labels have the same label
return most common label in the subsample
recursive case: for each feature: split the data calculate the cost for this stump pick feature with minimum cost
add left branch = build tree(left subsample, depth-1)
add right branch = build tree(right subsample, depth-1)
what do we risk as we split on fewer and fewer examples?
we could risk making new nested rules for few examples, which may just be outliers, just noise in the data
what two restrictions can we place on a decision tree to prevent overfitting?
maximum depth
minimum subsample size
what is a strong indicator that a decision tree is overfitted?
it is very deep
what is a decision tree splitting criterion?
split based on the number of errors, i.e. count the number of misclassified samples after split
what is another word for information gain
mutual information
how do we determine which features would give us the most information?
measure the information gain as a score for each feature
what does entropy measure?
entropy is a mathematical expression that quantifies the randomness- the amount of randomness
the entropy for variable X, H(X)=?
- sum for each x: p(x)logp(x)
log base 2
what is the equation for information gain, I(T;W) = ?
H(T) - H(T|W)
or
H(W) - H(W|T)
given that information gain is H(T) - H(T|W), and W can have two values, strong or weak. What is I(T;W) = H(T) - H(T|W)
calculate H(T), the entropy of T
calculate H(T|W=strong), the entropy of T where W is strong calculate H(T|W=weak), the entropy of T where W is week
weighted average of H(T|W=strong) and H(T|W=weak),
we get the maximum possible information gain when….
H(T|W)=0
the maximum possible value of the entropy of T, H(T) = ?
log(|T|)
what is pruning?
truncating a decision tree to prevent overfitting
what is prepruning?
stop the decision tree from growing too large
how can we create pre-pruning rules that were based on some knowledge about the training data set?
calculate the usefulness of each decision tree split at the point of computing the split- information gain
what are the benefits of pre-pruning (2)?
reduce the likelihood of overfitting
reduce the overall training time
what is the risk of prepruning?
it is possible for a decision to have low information gain but for child branches to have high information gain
what is post-pruning?
deliberately allow the tree to overfit to the training data by allowing ID3 to run until all the training data are perfectly classified.
Then examine the performance on validation data
systematically remove branches to improve the validation accuracy until it matches the training accuracy.
Two plausible leaf nodes (or base cases) for the id3 decision tree algorithm are:
When all labels are equal, or the minimum number of examples has been reached
In the ID3 algorithm, we choose to split based on
mutual information (Information gain)
what is pop in a decision tree leaf
the number of training points that arrived at that node
what is err of a leaf
the fraction of the examples arriving at that node that are incorrectly classified
how do we know the number of rules in a tree
count the number of possible paths.
Describe the distribution of a random variable where we will get the lowest entropy
The lowest entropy is calculated for a random variable that has a single event with a probability of 1.0, a certainty.
Describe the distribution of a random variable where we will get the highest entropy
The largest entropy for a random variable will be if all events are equally likely.