Decision Trees and ID3 Algorithm Flashcards

1
Q

Decision Tree Learning

A

Practical Inductive Inference (Supervised Learning) Method.
Find Boolean function of attributes
Decision trees can be extended to functions with more than two output values
Widely used. Robust to noise
Can handle OR and AND expressions
Completely expressive hypothesis space
Easily interpretable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Object Attribute Value (OAV) Data

A

Objects have attributes and attributes have values. For Example: A car has a door, the door can be red etc.
Usually as labelled csv, json or xml
One of the attributes, usually the first is the Object ID
Attributes can be numeric, categorical or Boolean
Many kaggle datasets are OAV data in csv or json format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Concept

A

A concept is a subset of all possible OAV tuples of an object. The number of possible tuples is all the attribute combinations. A class label is used to define the concept - which cars are desirable and which are not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

One-Hot Encoding

A

Scheme where categorical variables are encoded as Boolean sparse vectors. This helps ML algorithms because the encodings are a high-dimension vector rather than a scalar or symbol.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Classification

A

The process of adding a new attribute, the class label to OAV. We start with a labelled OAV dataset, which has a class label column. Find a model for the class attribute as a function of the values of the other attribute. Using the learned model previously unseen records should be assigned to classes as accurately as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a Decision Tree

A

A predictive model in the form of an upside-down tree. Each node (except leaves) represents an attribute of a given object. Each edge represents a value of the parent attribute. Each leaf is a class label. Boolean attribute values are not usually shown- left means yes and right means no.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The tree itself forms a hypothesis:

A

Disjunctions (OR) of Conjunctions (ANDs)
Each path from the root to a leaf forms a conjunction of constraints on attributes (aka a rule)
Separate branches are disjunctions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Naive Decision Tree Construction

A

Given a labelled dataset, we must construct a decision tree that segments the dataset into subsets which are either all labelled Yes or No. Each subset is associated with a rule - a path from the root to a leaf. We must ideally create the shortest possible trees. Shorter trees are more general. Constructing the shortest possible decision tree requires brute force.
For any given dataset there are a combinatorial number of decision trees that are consistent with the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Types of problems decision tree learning is good for:

A

Instances represented by OAV data
Attributes take on a small number of discrete values (As in Mitchell’s book)
Can be extended to real-valued attributes
Target function has discrete output values
Assume Boolean functions
Can be extended to use multiple output values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Decision Tree Uses

A

Hypothesis space can include disjunctive expressions. In fact, the hypothesis space is the complete space of finite discrete-valued functions.
Robust to imperfect training data: Classification errors, errors in attribute values, missing attribute values.
Ex. Medical diagnosis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Decision Tree Construction

A

Given a labelled OAV training set how does one create a decision tree for the class label. If the training set is small this can probably be done by visual inspection.
Decision trees are usually not unique. Given a labelled OAV training set there may be many combinations that fit with the training examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which Decision Trees do we choose?

A

The set of all decision trees that are consistent with (fit) a labelled OAV training set is called the Hypothesis Space. Occam’s razor says we should choose the simplest one. It has been shown that the shortest decision tree is the best (more general).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do we find the shortest decision tree?

A

Not by inspection unless very small. Use the ID3 Algorithm to do this (inductive dichotomizer 3). It finds a short tree but not necessarily the shortest as that requires exhaustive brute force search.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ID3 Algorithm Definition

A

Top down, greedy search through space of possible decision trees. Decision trees represent hypotheses, so this is a search through hypothesis space. As you proceed down, choose attributes for each successive node. No backtracking. Goes from root to leaves.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Greedy Search?

A

At each step, make decision which makes greatest improvement in what you are trying to optimize. Don’t backtrack unless you hit a dead end. Not likely to find global optimum, but generally works well.
At each node, make a decision on which attribute best classifies training data at that point. Do this for each branch of the tree, end result will be tree structure representing a hypothesis which works best for the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

ID3 Algorithm (How it works)

A

Finds a shallow tree consistent with the training data. Uses two important concepts:
Entropy
Information Gain
Both used in order to determine which attribute is the best class discriminator. Choosing the best discriminator makes the tree shorter, which generalizes better.
ID3 has better time complexity but C4.5 is better at finding a shorter tree.
Entropy determines which attribute best classifies data.

17
Q

Information Gain

A

Statistical quantity measuring how well an attribute classifies the data. Calculate the information gain for each attribute. Choose attribute with greatest information gain.

18
Q

Entropy

A

Entropy measures information content of a random process:
Takes on largest value when events are equi-probable.
Takes on smallest value when only one event has a non-zero probability.

19
Q

Computing Entropy

A

H(S) = -p(+) log2(p(+)) - p(-)log2(p(-))
If you want to find the entropy of a class of students, there are 50 students, 30 girls and 20 boys. Let girls be positive and boys be negative. Substitute them and with their corresponding values in the equation.

20
Q

Computing Information Gain

A

Gain(S, Ai) = H(S) - 𝝨P(Ai=v)H(Sv)

The attribute with the highest information gain is usually chosen as the splitting attribute at each node in the decision tree. The goal is to maximize information gain at each decision node, thereby incrementally reducing uncertainty or entropy and making the data subsets at the leaves of the tree as pure as possible.

21
Q

ID3 for Boolean Valued Class Labels

A

Calculate the entropy once for full training set with positive and negative subsets. Determine which single attribute best classifies the training examples using information gain. For each attribute find information gain. Use attribute with greatest information gain as root. Function p() is the proportion.

22
Q

ID3 Choosing the Root Node

A

Loop over all attributes and compute the entropy for each attribute. Choose the node with the greatest information gain.

23
Q

ID3 Restrictions

A

Attributes are excluded from consideration if they appear higher in the tree.
Process continues in each new leaf node until every attribute has already been included along the path through the tree or training examples associated with this leaf all have the same target attribute value.

24
Q

ID3 Considerations

A

If there are contradictions in the data, we always take the majority vote. This handles noisy data. Attributes are eliminated when they are assigned to a node and never reconsidered on the same Root-to-leaf path.

25
Q

ID3 Gini Impurity

A

Alternative to information gain as a measure of the information-theoretic purity of a node. Defined as follows: “Where is the set of classes, and is the probability of choosing an item that belongs to the class”.

Ex. If we have 100 elements and 3 classes, Class 1 with 40 and the others with 30. First calculate the probabilities of each class, then compute the Gini Impurity. For this dataset it is 0.66. It falls within the range of 0 to 1 as expected. This value indicates the degree of impurity or disorder present in the database. Much quicker to use than Information Gain.

26
Q

Information Gain Vs Gini Impurity

A

Bias: Information gain tends to favor attributes with more levels, i.e higher reduction in entropy. Gini Impurity is less susceptible to this bias.

Tree Structure: Trees grown using Information Gain tend to be deeper and more balanced, while on Gini can be more shallow and skewed.

Performance: Generally advisable to experiment with both to see which yields better for the specific problem.

Suitability: Gini often used for classification problems, Information Gain used for both classification and regression as it has roots in Information Theory.

27
Q

ID3 Summary

A

Once the attribute is selected for the current node, generate children nodes, one for each possible value of the selected attribute.
Partition the examples of this node using the possible values of this attribute and assign these subsets of the examples to the appropriate child nodes.
Repeat for each child node until all examples associated with a node are either all positive or negative.

28
Q

ID3 - Choosing the best attribute

A

Some possibilities:
- Random: Select any attribute at random
- Least-values: Choose the attribute with the smallest number of possible values
- Most-values: Choose the attribute with largest number of possible values
- Max-Gain: Choose the attribute that has the largest expected information gain, i.e select attribute that will result in the smallest expected size of the subtrees rooted at its children
- Min-Gini: Choose the attribute that has the smallest Gini Impurity value.
ID3 uses max-gain to select the best attribute at each node of the decision tree under construction.

29
Q

Inductive Bias of ID3

A

Every learning algorithm has an inductive bias (Restriction vs Preference).

ID3:
Searches complete hypothesis space
Performs an incomplete search through this space looking for simplest tree
Called a preference (or search) bias

Candidate-Elimination:
Searches an incomplete hypothesis space
Performs a complete search finding all valid hypotheses
Called a restriction (or language) bias

Typically a preference bias is better since you don’t limit your search up front by restricting the hypothesis space.

30
Q

How well do decision trees work?

A

Decision trees are at least as accurate as humans expect (Studies shown)
A study for diagnosing breast cancer:
Humans correctly classified it 65% of the time
The decision tree classified it 72% of the time
British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms
Replaced an earlier rule based expert system

31
Q

Noisy Data and Overfitting

A

Many kinds of noise could occur in training examples:

Two examples have same attribute/value pairs but different classifications
Some values of attributes are incorrect because of:
- Errors in the data acquisition process
- Errors in the preprocessing phase

The classification is wrong because of some error or mistake.
Some Attributes may be irrelevant to the decision making process ex, color of eyes may be irrelevant to prognosis. Irrelevant attributes can result in overfitting the training data.

32
Q

C4.5 Decision Tree Algorithm

A

C4.5 is an algorithm used to generate decision trees, which can be employed for classification tasks.

Handling Missing Values: C4.5 can handle missing attribute values
Continuous Attributes: can work with categorical and continuous attributes

Pruning: To avoid overfitting, C4.5 uses a pruning strategy that replaces sub-trees with leaf nodes if the pruning results in minimal loss of accuracy

Rule Derivation: Post tree generation, it can convert the decision tree to a set of if-then rules to improve readability.

Attribute Selection: It employs a normalized version of information gain known as Gain Ratio to avoid bias toward attributes with a large number of distinct values.

33
Q

ID3 Tools Used

A

RapidMiner
DecisionTree (python package)

34
Q

Decision Tree Learning

A

Inducing decision trees is one of the most widely used learning methods.
Can outperform human experts in many problems

35
Q

Decision Tree Learning Strengths

A

Fast
Simple to implement
Can convert results to a set of easily interpretable rules
Empirically valid in many commercial products
Handles noisy data

36
Q

Decision Tree Learning Weaknesses

A

Univariate splits/partitioning using only one attribute at a time so limits types of possible trees.
Large decision trees may be hard to understand
Requires fixed-length feature vectors

37
Q

Summary of ID3 Inductive Bias

A

Shorter trees are preferred over long trees. Accepts the first tree it finds. Information gain heuristic places high information gain attributes near root, and greedy search method is an approximation to finding the shortest (shallowest) tree. Shorter trees would be preferred due to occam’s razor.