DecisionTree Flashcards
explain tree nodes
A job offer to be considered
begins at the root node, where it is then passed through decision nodes that require
choices to be made based on the attributes of the job. These choices split the data
across branches that indicate potential outcomes of a decision, depicted here as yes
or no outcomes, though in some cases there may be more than two possibilities.
In the case a final decision can be made, the tree is terminated by leaf nodes (also
known as terminal nodes) that denote the action to be taken as the result of the series
of decisions. In the case of a predictive model, the leaf nodes provide the expected
result given the series of events in the tree.
Benefits of Decision tree
A great benefit of decision tree algorithms is that the flowchart-like tree structure is not
necessarily exclusively for the learner’s internal use. After the model is created, many
decision tree algorithms output the resulting structure in a human-readable format.
This provides tremendous insight into how and why the model works or doesn’t work.In fact, decision trees are
perhaps the single most widely used machine learning technique, and can be applied
to model almost any type of data—often with excellent out-of-the-box applications.
negatives of decision tree
it is worth noting some scenarios
where trees may not be an ideal fit. One such case might be a task where the data
has a large number of nominal features with many levels or it has a large number
of numeric features. These cases may result in a very large number of decisions and
an overly complex tree. They may also contribute to the tendency of decision trees
to overfit data, though as we will soon see, even this weakness can be overcome by
adjusting some simple parameters.
recursive partitioning
This approach
is also commonly known as divide and conquer because it splits the data into
subsets, which are then split repeatedly into even smaller subsets, and so on and
so forth until the process stops when the algorithm determines the data within the
subsets are sufficiently homogenous, or another stopping criterion has been met.
How dataset created a tree
To see how splitting a dataset can create a decision tree, imagine a bare root node
that will grow into a mature tree. At first, the root node represents the entire dataset,
since no splitting has transpired. Next, the decision tree algorithm must choose a
feature to split upon; ideally, it chooses the feature most predictive of the target class.
The examples are then partitioned into groups according to the distinct values of this
feature, and the first set of tree branches are formed.
Divide and Conquer – Classification Using Decision Trees and Rules
[ 128 ]
Working down each branch, the algorithm continues to divide and conquer the data,
choosing the best candidate feature each time to create another decision node, until a
stopping criterion is reached. Divide and conquer might stop at a node in a case that:
• All (or nearly all) of the examples at the node have the same class
• There are no remaining features to distinguish among the examples
• The tree has grown to a predefined size limit
C50 Algorithm
Strengths • An all-purpose classifier that does well on most problems • Highly automatic learning process, which can handle numeric or nominal features, as well as missing data • Excludes unimportant features • Can be used on both small and large datasets • Results in a model that can be interpreted without a mathematical background (for relatively small trees) • More efficient than other complex models Weaknesses • Decision tree models are often biased toward splits on features having a large number of levels • It is easy to overfit or underfit the model • Can have trouble modeling some relationships due to reliance on axis-parallel splits • Small changes in the training data can result in large changes to decision logic • Large trees can be difficult to interpret and the decisions they make may seem counterintuitive
purity
The degree to which a subset of examples contains only a single class is known as purity, and any subset composed of only a single class is called pure.
entropy
C5.0 uses entropy, a concept borrowed from
information theory that quantifies the randomness, or disorder, within a set of class
values. Sets with high entropy are very diverse and provide little information about
other items that may also belong in the set, as there is no apparent commonality.
The decision tree hopes to find splits that reduce entropy, ultimately increasing
homogeneity within the groups.
entropy measurement, bits
Typically, entropy is measured in bits. If there are only two possible classes, entropy
values can range from 0 to 1. For n classes, entropy ranges from 0 to log2(n). In each
case, the minimum value indicates that the sample is completely homogenous, while
the maximum value indicates that the data are as diverse as possible, and no group
has even a small plurality.
In the mathematical notion, entropy is specified as follows:
In this formula, for a given segment of data (S), the term c refers to the number
of class levels and pi refers to the proportion of values falling into class level i. For
example, suppose we have a partition of data with two classes: red (60 percent) and
white (40 percent). We can calculate the entropy as follows:
> -0.60 * log2(0.60) - 0.40 * log2(0.40)
[1] 0.9709506
What does C50 use?
C5.0 uses entropy, a concept borrowed from
information theory that quantifies the randomness, or disorder, within a set of class
values. Sets with high entropy are very diverse and provide little information about
other items that may also belong in the set, as there is no apparent commonality.
The decision tree hopes to find splits that reduce entropy, ultimately increasing
homogeneity within the groups.
Typically, entropy is measured in bits.
spliting the entropy for decsion tree.
As illustrated by the peak in entropy at x = 0.50, a 50-50 split results in maximum
entropy. As one class increasingly dominates the other, the entropy reduces to zero.
To use entropy to determine the optimal feature to split upon, the algorithm
calculates the change in homogeneity that would result from a split on each possible
feature, which is a measure known as information gain. The information gain for a
feature F is calculated as the difference between the entropy in the segment before
the split (S1) and the partitions resulting from the split (S2):
information gain
One complication is that after a split, the data is divided into more than one partition.
Therefore, the function to calculate Entropy(S2) needs to consider the total entropy
across all of the partitions. It does this by weighing each partition’s entropy by the
proportion of records falling into the partition. This can be stated in a formula as:
In simple terms, the total entropy resulting from a split is the sum of the entropy
of each of the n partitions weighted by the proportion of examples falling in the
partition (wi).
The higher the information gain, the better a feature is at creating homogeneous
groups after a split on this feature. If the information gain is zero, there is no
reduction in entropy for splitting on this feature. On the other hand, the maximum
information gain is equal to the entropy prior to the split. This would imply that
the entropy after the split is zero, which means that the split results in completely
homogeneous groups.
pruning
A decision tree can continue to grow indefinitely, choosing splitting features and
dividing the data into smaller and smaller partitions until each example is perfectly
classified or the algorithm runs out of features to split on. However, if the tree grows
overly large, many of the decisions it makes will be overly specific and the model
will be overfitted to the training data. The process of pruning a decision tree involves
reducing its size such that it generalizes better to unseen data.
pre-pruning
One solution to this problem is to stop the tree from growing once it reaches a certain
number of decisions or when the decision nodes contain only a small number of
examples. This is called early stopping or pre-pruning the decision tree. As the tree
avoids doing needless work, this is an appealing strategy. However, one downside
to this approach is that there is no way to know whether the tree will miss subtle, but
important patterns that it would have learned had it grown to a larger size.
post-pruning
An alternative, called post-pruning, involves growing a tree that is intentionally
too large and pruning leaf nodes to reduce the size of the tree to a more appropriate
level. This is often a more effective approach than pre-pruning, because it is quite
difficult to determine the optimal depth of a decision tree without growing it first.
Pruning the tree later on allows the algorithm to be certain that all the important data
structures were discovered