Decision Trees - Classification Trees Flashcards
Trees for Binary Target
If the ith observation is in node j, then the prediction for that observation, y^_i, equals the majority target class of all observations in node j. Similarly, the predicted probability that the ith observation is positive equals the proportion of positive observations in node j.
Instead of minimizing SSE, an impurity measure of choice would be used. (Gini index or entropy)
Gini index: G = 1 - p^2_0 - p^2_1
Entropy = -p_0log_2(p_0) - p_1log_2(p_1)
-p_0 is the proportion of observations with a target of 0
-p_1 is the proportion of observations with a target of 1
Minimizing an impurity measure is equivalent to maximizing the information gain in that measure. Information gain is calculated as the measure for the parent node minus the weighted average measure between the child nodes.
– G_p - [(n_LG_L + n_RG_R)/n_p] ->G or E
Pruning
When pruning a classification tree, possible CV metrics include the classification error rate and accuracy (its compliment)
Classification trees and factor variables
Because day of month is coded as a categorical variable, the number of levels is 31. This means the number of ways to split day of the month into two groups is very large, making it likely that the tree will find spurious splits that happens to produce information gain for that particular training data.
Decision trees tend to create splits on categorical variables with many levels because it is easier to choose a split where the information gain is large. However, splitting on these variables will likely to lead to overfitting.
Decision trees split the levels of a categorical variable into groups. The more levels the variable has, the more potential ways to split the category into groups. The decision tree algorithm will identify which variables to split and into which groups based on maximizing information gain. Decision trees naturally allow for interactions between categorical variables based on how the tree is created. For instance, a leaf node could have two or more parent nodes that split based on categorical variables, which would
represent the interactions of those categorical variables. The tree may also split on the same variable more than once in the tree.