Chapter 5: Decision Trees Flashcards
can decision trees be used for both regression and classification problems?
yes
how to make a prediction within a decision tree?
locate the region to which the observation belongs, and use the average or the majority class of the target variable in that region as the prediction
what is a node? what are the two types of nodes?
a point on a decision tree that corresponds to a subset of the training data.
- root node: node at the top of the tree representing the full dataset
- terminal nodes: at the bottom of the tree. an endpoint
what is a measure of complexity for a tree?
depth.
the number of tree splits needed to go from the tree’s root to the furthest terminal node.
larger depth = more complex
T/F: When making a split in a tree, we separate the training observations into two subgroups according to the value or level of one and only one predictor at a time.
True
every time we make a binary split, there are two related decisions that we have to make:
- what predictor to split
2. given the split predictor, the corresponding cutoff value
what is the goal for terminal nodes?
to create terminal nodes that contain similar observations (that are easy to predict and analyze)
what is the measure of impurity for a regression tree?
RSS
- the lower the RSS, the more homogeneous the target observations are.
write out the equation for the RSS of a node m.
what are the three measures of node impurity for classification trees? do we want a large or small value?
- classification error rate
- gini index
- entropy
we want a smaller value = homogenous node (think of classification error rate)
in terms of classification trees, what does p[m,k] mean? what is the formula?
it is the proportion of target observations in terminal node m (Rm) that are apart of the kth class
p[m,k] = # of training observations of kth class in Rm / total # of observations in Rm
what is the classification error rate? what is the formula?
it is the proportion of misclassified observations in a terminal node m.
E[m] = 1 - max p[m,k]
recall, max p[m,k] is the proportion of observations in Rm belonging to the most common level of the target variable. these observations are correctly classified in that terminal node. Then, the compliment of that is the proportion of observations that are misclassified
what does it mean when the classification error rate is close to 0?
this means that almost all of the training observations in Rm are correctly classified
what is a common criticism against the classification error rate?
that it is not sensitive enough to node purity in the sense that it depends on the class proportions only through the maximum
which two measures capture node impurity more effectively?
gini index and entropy
what is the gini index? what is the formula for it?
the gini index is a measure of variance across the K classes of the categorical target variable.
high degree of purity = lower gini index (because more variability = less purity)
write out the formula
when does gini index achieve its minimum/maximum value?
max value: if the observations are evenly spread across the K classes
min value: if the p[m,k] =1 for some k
what is the formula for entropy?
write it out
the entropy increases with the degree of impurity in the node
draw the graph of all 3 impurity measures
draw
all three impurity measures behave similarly, when do all three achieve their max/min value?
max: p = 0.5 (when the node is most impure)
min: p = 0, 1 (when the node is most pure)
what is recursive binary splitting? what is the first step?
- decide on an impurity measure
- now we can construct the first binary split with the goal of minimizing this impurity measure.
- after the first split, there will be 2 regions in the predictor space. then, we look for the best predictor and corresponding cutoff point for splitting one of the two regions. (could be the same predictor used in the first split)
- this continues recursively….
T/F: recursive binary splitting is a greedy algorithm.
true: in each step of the building process, we make the split that is currently the best, instead of looking ahead and selecting a split that results in a better tree in a future step
how do we control the complexity of a tree?
cost complexity pruning
what is cost complexity pruning?
decision trees can often be made too complex, and therefor overfit a model quite easily in an attempt to capture signals as well as noise in the training data
the idea behind cost complexity pruning is that we grow a very large and complex tree, and then prune it back to get a subtree. this algorithm is parameterized by a complexity parameter.
in cost complexity pruning, what is the penalized objective function?
relative training error + complexity parameter * # of terminal nodes
in cost complexity pruning, what is the relative training error in the penalized objective function?
it is the training error of the tree scaled by the training error of the simplest tree (the tree with no splits)
in cost complexity pruning, what is the penalty term?
complexity parameter + # of terminal nodes
this is in place to prevent overfitting. The role played by the complexity parameter is the same as the role played by lambda in regularization
cost complexity pruning: what happens if the complexity parameter (cp) =0?
there is no price we have to pay for making the tree complex and the fitted tree is identical to the large, overblown tree without pruning
cost complexity pruning: what happens as cp increases from 0 to 1?
the penalty term becomes greater and the tree branches are snipped off the tree retroactively (from bottom to top) to form new, larger terminal nodes.
the squared bias of the tree predictions will increase, but the variance will drop
if cp = 1, then the tree will only have the root node only.