DecisionTree Flashcards by pmb26@yahoo.com Bick

explain tree nodes

A job offer to be considered
begins at the root node, where it is then passed through decision nodes that require
choices to be made based on the attributes of the job. These choices split the data
across branches that indicate potential outcomes of a decision, depicted here as yes
or no outcomes, though in some cases there may be more than two possibilities.
In the case a final decision can be made, the tree is terminated by leaf nodes (also
known as terminal nodes) that denote the action to be taken as the result of the series
of decisions. In the case of a predictive model, the leaf nodes provide the expected
result given the series of events in the tree.

How well did you know this?

Not at all

Perfectly

Benefits of Decision tree

A great benefit of decision tree algorithms is that the flowchart-like tree structure is not
necessarily exclusively for the learner’s internal use. After the model is created, many
decision tree algorithms output the resulting structure in a human-readable format.
This provides tremendous insight into how and why the model works or doesn’t work.In fact, decision trees are
perhaps the single most widely used machine learning technique, and can be applied
to model almost any type of data—often with excellent out-of-the-box applications.

How well did you know this?

Not at all

Perfectly

negatives of decision tree

it is worth noting some scenarios
where trees may not be an ideal fit. One such case might be a task where the data
has a large number of nominal features with many levels or it has a large number
of numeric features. These cases may result in a very large number of decisions and
an overly complex tree. They may also contribute to the tendency of decision trees
to overfit data, though as we will soon see, even this weakness can be overcome by
adjusting some simple parameters.

How well did you know this?

Not at all

Perfectly

recursive partitioning

This approach
is also commonly known as divide and conquer because it splits the data into
subsets, which are then split repeatedly into even smaller subsets, and so on and
so forth until the process stops when the algorithm determines the data within the
subsets are sufficiently homogenous, or another stopping criterion has been met.

How well did you know this?

Not at all

Perfectly

How dataset created a tree

To see how splitting a dataset can create a decision tree, imagine a bare root node
that will grow into a mature tree. At first, the root node represents the entire dataset,
since no splitting has transpired. Next, the decision tree algorithm must choose a
feature to split upon; ideally, it chooses the feature most predictive of the target class.
The examples are then partitioned into groups according to the distinct values of this
feature, and the first set of tree branches are formed.
Divide and Conquer – Classification Using Decision Trees and Rules
[ 128 ]
Working down each branch, the algorithm continues to divide and conquer the data,
choosing the best candidate feature each time to create another decision node, until a
stopping criterion is reached. Divide and conquer might stop at a node in a case that:
• All (or nearly all) of the examples at the node have the same class
• There are no remaining features to distinguish among the examples
• The tree has grown to a predefined size limit

How well did you know this?

Not at all

Perfectly

C50 Algorithm

Strengths 
• An all-purpose classifier that does well
on most problems
• Highly automatic learning process,
which can handle numeric or nominal
features, as well as missing data
• Excludes unimportant features
• Can be used on both small and large
datasets
• Results in a model that can be
interpreted without a mathematical
background (for relatively small trees)
• More efficient than other complex
models
Weaknesses
• Decision tree models are often
biased toward splits on features
having a large number of levels
• It is easy to overfit or underfit the
model
• Can have trouble modeling some
relationships due to reliance on
axis-parallel splits
• Small changes in the training
data can result in large changes to
decision logic
• Large trees can be difficult to
interpret and the decisions they
make may seem counterintuitive

How well did you know this?

Not at all

Perfectly

purity

The degree to
which a subset of examples contains only a single class is known as purity, and any
subset composed of only a single class is called pure.

How well did you know this?

Not at all

Perfectly

entropy

C5.0 uses entropy, a concept borrowed from
information theory that quantifies the randomness, or disorder, within a set of class
values. Sets with high entropy are very diverse and provide little information about
other items that may also belong in the set, as there is no apparent commonality.
The decision tree hopes to find splits that reduce entropy, ultimately increasing
homogeneity within the groups.

How well did you know this?

Not at all

Perfectly

entropy measurement, bits

Typically, entropy is measured in bits. If there are only two possible classes, entropy
values can range from 0 to 1. For n classes, entropy ranges from 0 to log2(n). In each
case, the minimum value indicates that the sample is completely homogenous, while
the maximum value indicates that the data are as diverse as possible, and no group
has even a small plurality.
In the mathematical notion, entropy is specified as follows:
In this formula, for a given segment of data (S), the term c refers to the number
of class levels and pi refers to the proportion of values falling into class level i. For
example, suppose we have a partition of data with two classes: red (60 percent) and
white (40 percent). We can calculate the entropy as follows:
> -0.60 * log2(0.60) - 0.40 * log2(0.40)
[1] 0.9709506

How well did you know this?

Not at all

Perfectly

What does C50 use?

How well did you know this?

Not at all

Perfectly

spliting the entropy for decsion tree.

As illustrated by the peak in entropy at x = 0.50, a 50-50 split results in maximum
entropy. As one class increasingly dominates the other, the entropy reduces to zero.
To use entropy to determine the optimal feature to split upon, the algorithm
calculates the change in homogeneity that would result from a split on each possible
feature, which is a measure known as information gain. The information gain for a
feature F is calculated as the difference between the entropy in the segment before
the split (S1) and the partitions resulting from the split (S2):

How well did you know this?

Not at all

Perfectly

information gain

One complication is that after a split, the data is divided into more than one partition.
Therefore, the function to calculate Entropy(S2) needs to consider the total entropy
across all of the partitions. It does this by weighing each partition’s entropy by the
proportion of records falling into the partition. This can be stated in a formula as:
In simple terms, the total entropy resulting from a split is the sum of the entropy
of each of the n partitions weighted by the proportion of examples falling in the
partition (wi).
The higher the information gain, the better a feature is at creating homogeneous
groups after a split on this feature. If the information gain is zero, there is no
reduction in entropy for splitting on this feature. On the other hand, the maximum
information gain is equal to the entropy prior to the split. This would imply that
the entropy after the split is zero, which means that the split results in completely
homogeneous groups.

How well did you know this?

Not at all

Perfectly

pruning

A decision tree can continue to grow indefinitely, choosing splitting features and
dividing the data into smaller and smaller partitions until each example is perfectly
classified or the algorithm runs out of features to split on. However, if the tree grows
overly large, many of the decisions it makes will be overly specific and the model
will be overfitted to the training data. The process of pruning a decision tree involves
reducing its size such that it generalizes better to unseen data.

How well did you know this?

Not at all

Perfectly

pre-pruning

One solution to this problem is to stop the tree from growing once it reaches a certain
number of decisions or when the decision nodes contain only a small number of
examples. This is called early stopping or pre-pruning the decision tree. As the tree
avoids doing needless work, this is an appealing strategy. However, one downside
to this approach is that there is no way to know whether the tree will miss subtle, but
important patterns that it would have learned had it grown to a larger size.

How well did you know this?

Not at all

Perfectly

post-pruning

An alternative, called post-pruning, involves growing a tree that is intentionally
too large and pruning leaf nodes to reduce the size of the tree to a more appropriate
level. This is often a more effective approach than pre-pruning, because it is quite
difficult to determine the optimal depth of a decision tree without growing it first.
Pruning the tree later on allows the algorithm to be certain that all the important data
structures were discovered

How well did you know this?

Not at all

Perfectly

predict function

Study These Flashcards

To apply our decision tree to the test dataset, we use the predict() function,
as shown in the following line of code:
> credit_pred

adaptive boosting

Study These Flashcards

One way the C5.0 algorithm improved upon the C4.5 algorithm was through the
addition of adaptive boosting. This is a process in which many decision trees are
built and the trees vote on the best class for each example.
We
simply need to add an additional trials parameter indicating the number of
separate decision trees to use in the boosted team. The trials parameter sets an
upper limit; the algorithm will stop adding trees if it recognizes that additional trials
do not seem to be improving the accuracy. We’ll start with 10 trials, a number that
has become the de facto standard, as research suggests that this reduces error rates
on test data by about 25 percent:
> credit_boost10

why not boost by default?

Study These Flashcards

This said, if boosting can be added this easily, why not apply it by default to every
decision tree? The reason is twofold. First, if building a decision tree once takes a great
deal of computation time, building many trees may be computationally impractical.
Secondly, if the training data is very noisy, then boosting might not result in an
improvement at all. Still, if greater accuracy is needed, it’s worth giving it a try.

cost matrix

Study These Flashcards

cost matrix, which specifies how much costlier each error is, relative to any
other prediction.
create:
matrix_dimensions

cost matrix data

Study These Flashcards

Since R fills a matrix by filling columns one by one from top
to bottom, we need to supply the values in a specific order:
• Predicted no, actual no
• Predicted yes, actual no
• Predicted no, actual yes
• Predicted yes, actual yes
Suppose we believe that a loan default costs the bank four times as much as a missed
opportunity. Our penalty values could then be defined as:
> error_cost error_cost
actual
predicted no yes
no 0 4
yes 1 0.
add costs
credit_cost

antecent, consequent

Study These Flashcards

Classification rules represent knowledge in the form of logical if-else statements that
assign a class to unlabeled examples. They are specified in terms of an antecedent
and a consequent; these form a hypothesis stating that “if this happens, then that
happens.” A simple rule might state, “if the hard drive is making a clicking sound,
then it is about to fail.” The antecedent comprises certain combinations of feature
values, while the consequent specifies the class value to assign when the rule’s
conditions are met.

Separate and conquer

Study These Flashcards

Separate and conquer
Classification rule learning algorithms utilize a heuristic known as separate and
conquer. The process involves identifying a rule that covers a subset of examples in
the training data, and then separating this partition from the remaining data. As the
rules are added, additional subsets of the data are separated until the entire dataset
has been covered and no more examples remain.

Classification rules

Study These Flashcards

Classification rules can also be obtained directly from decision trees. Beginning at a
leaf node and following the branches back to the root, you will have obtained a series
of decisions. These can be combined into a single rule.
example:
Following the paths from the root node down to each leaf, the rules would be:
1. If the number of celebrities is low, then the movie will be a Box Office Bust.
2. If the number of celebrities is high and the budget is high, then the movie
will be a Mainstream Hit.
3. If the number of celebrities is high and the budget is low, then the movie will
be a Critical Success.

downside to decision tree

Study These Flashcards

For reasons that will be made clear in the following section, the chief downside
to using a decision tree to generate rules is that the resulting rules are often more
complex than those learned by a rule learning algorithm. The divide and conquer
strategy employed by decision trees biases the results differently than that of a
rule learner. On the other hand, it is sometimes more computationally efficient to
generate rules from trees.

greedy learners

Decision trees and rule learners are known as greedy learners because they use data on a first-come, first-served basis. Both the divide and conquer heuristic used by decision trees and the separate and conquer heuristic used by rule learners attempt to make partitions one at a time, finding the most homogeneous partition first, followed by the next best, and so on, until all examples have been classified. The downside to the greedy approach is that greedy algorithms are not guaranteed to generate the optimal, most accurate, or smallest number of rules for a particular dataset. By taking the low-hanging fruit early, a greedy learner may quickly find a single rule that is accurate for one subset of data; however, in doing so, the learner may miss the opportunity to develop a more nuanced set of rules with better overall accuracy on the entire set of data. However, without using the greedy approach to rule learning, it is likely that for all but the smallest of datasets, rule learning would be computationally infeasible. Chapter 5 [ 159 ] Though both trees and rules employ greedy learning heuristics, there are subtle differences in how they build rules. Perhaps the best way to distinguish them is to note that once divide and conquer splits on a feature, the partitions created by the split may not be re-conquered, only further subdivided. In this way, a tree is permanently limited by its history of past decisions.

steps to do decison tree

1. collect data 2. explore and prepare data mushrooms set.seed(123) > train_sample credit_train credit_test 50K", dnn = c("Prediction","TRUE")) mmetric(testdata1$income, tree_prediction1, c("ACC","PRECISION", "TPR", "F1")) summary(tree_prediction1) shows the matrix to boost add trails=10 example

seed value

seed value, which causes the randomization process to follow a sequence that can be replicated later on if desired. It may seem that this defeats the purpose of generating random numbers, but there is a good reason for doing it this way. Providing a seed value via the set.seed() function ensures that if the analysis is repeated in the future, an identical result is obtained.

partition data

set.seed(100) | InTrain

classification rules

``` Classification rules represent knowledge in the form of logical if-else statements that assign a class to unlabeled examples ```

DecisionTree Flashcards

(29 cards)