Decision Trees and ID3 Algorithm Flashcards
Decision Tree Learning
Practical Inductive Inference (Supervised Learning) Method.
Find Boolean function of attributes
Decision trees can be extended to functions with more than two output values
Widely used. Robust to noise
Can handle OR and AND expressions
Completely expressive hypothesis space
Easily interpretable.
Object Attribute Value (OAV) Data
Objects have attributes and attributes have values. For Example: A car has a door, the door can be red etc.
Usually as labelled csv, json or xml
One of the attributes, usually the first is the Object ID
Attributes can be numeric, categorical or Boolean
Many kaggle datasets are OAV data in csv or json format.
Concept
A concept is a subset of all possible OAV tuples of an object. The number of possible tuples is all the attribute combinations. A class label is used to define the concept - which cars are desirable and which are not.
One-Hot Encoding
Scheme where categorical variables are encoded as Boolean sparse vectors. This helps ML algorithms because the encodings are a high-dimension vector rather than a scalar or symbol.
Classification
The process of adding a new attribute, the class label to OAV. We start with a labelled OAV dataset, which has a class label column. Find a model for the class attribute as a function of the values of the other attribute. Using the learned model previously unseen records should be assigned to classes as accurately as possible.
What is a Decision Tree
A predictive model in the form of an upside-down tree. Each node (except leaves) represents an attribute of a given object. Each edge represents a value of the parent attribute. Each leaf is a class label. Boolean attribute values are not usually shown- left means yes and right means no.
The tree itself forms a hypothesis:
Disjunctions (OR) of Conjunctions (ANDs)
Each path from the root to a leaf forms a conjunction of constraints on attributes (aka a rule)
Separate branches are disjunctions
Naive Decision Tree Construction
Given a labelled dataset, we must construct a decision tree that segments the dataset into subsets which are either all labelled Yes or No. Each subset is associated with a rule - a path from the root to a leaf. We must ideally create the shortest possible trees. Shorter trees are more general. Constructing the shortest possible decision tree requires brute force.
For any given dataset there are a combinatorial number of decision trees that are consistent with the training set.
Types of problems decision tree learning is good for:
Instances represented by OAV data
Attributes take on a small number of discrete values (As in Mitchell’s book)
Can be extended to real-valued attributes
Target function has discrete output values
Assume Boolean functions
Can be extended to use multiple output values
Decision Tree Uses
Hypothesis space can include disjunctive expressions. In fact, the hypothesis space is the complete space of finite discrete-valued functions.
Robust to imperfect training data: Classification errors, errors in attribute values, missing attribute values.
Ex. Medical diagnosis
Decision Tree Construction
Given a labelled OAV training set how does one create a decision tree for the class label. If the training set is small this can probably be done by visual inspection.
Decision trees are usually not unique. Given a labelled OAV training set there may be many combinations that fit with the training examples.
Which Decision Trees do we choose?
The set of all decision trees that are consistent with (fit) a labelled OAV training set is called the Hypothesis Space. Occam’s razor says we should choose the simplest one. It has been shown that the shortest decision tree is the best (more general).
How do we find the shortest decision tree?
Not by inspection unless very small. Use the ID3 Algorithm to do this (inductive dichotomizer 3). It finds a short tree but not necessarily the shortest as that requires exhaustive brute force search.
ID3 Algorithm Definition
Top down, greedy search through space of possible decision trees. Decision trees represent hypotheses, so this is a search through hypothesis space. As you proceed down, choose attributes for each successive node. No backtracking. Goes from root to leaves.
What is Greedy Search?
At each step, make decision which makes greatest improvement in what you are trying to optimize. Don’t backtrack unless you hit a dead end. Not likely to find global optimum, but generally works well.
At each node, make a decision on which attribute best classifies training data at that point. Do this for each branch of the tree, end result will be tree structure representing a hypothesis which works best for the training data.