Lecture 9 - Decision Trees Flashcards

Question 1

Q

Decision Tree

Answer

A

Data-driven method
Popular classification technique

Reasons

Performs well across a wide range of situations
Does not require much effort from the analyst
Easy understandable by the consumers
- At least when the trees are not too large
Can be used for both:
- Classification, called classification trees
- Prediction, called regression trees

Question 2

Q

Example

Question 3

Q

Nodes

Answer

A

Conditions in the nodes give the splitting value on a predictor
The number inside the node gives the records after the split
The bracket provides the number of records per class: [not acceptor, acceptor]
The leaf nodes, named terminals, are marked with color to indicate a non-acceptor (orange) or acceptor (blue)

Question 4

Q

Trees are easy translated into a set of rules

Question 5

Q

Induction (with a Greedy Strategy)

Answer

A

Tree is constructed in a top-down recursive divide-and-conquer manner
St start, all the training instances are at the root
Instances, i.e., from the training set, are then partitioned recursively based on selected attributes

Question 6

Q

Issues with Induction (with a Greedy Strategy)

Answer

A

Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting

Specifying Test Condition

Depends on attribute type:
- Nominal
- Ordinal
- Continuous
Depends on number of ways to split:
Binary split, i.e., 2-way
Multi-way split

Question 7

Q

Splitting based on nominal attributes

Question 8

Q

Splitting based on Continuous Attributes

Answer

A

Discretization vs binary

Question 9

Q

Determining the Best Split

Question 10

Q

Information gain

Answer

A

Used to determine which feature/attribute provide the maximum information about a class
Split records based on an attribute test optimising certain criterion
Need a measure of node impurity, e.g., Gini Index, Entropy, etc.

Question 11

Q

Information gain (visual)

Question 12

Q

Gini Index

Question 13

Q

Entropy measure

Question 14

Q

Combined impurity

Question 15

Q

Categorical Attributes

Question 16

Q

Stopping Criteria for Tree Induction

Answer

Study These Flashcards

A

Stop expanding a node when all the records belong to the same class
Stop expanding a node when all the records have similar attribute values
Early termination (to be discussed later)

Question 17

Q

How to Address Overfitting

Answer

Study These Flashcards

A

Pre-Pruning

Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:
- Stop if all instances belong to the same class
- Stop if all attribute values are the same
More restrictive conditions:
- Stop if number of instances is less than some user-specified threshold
- Stop if expanding the current node does not improve impurity measures, e.g., Gini or information gain

Question 18

Q

How to Address Overfitting

Answer

Study These Flashcards

A

Post-pruning

Grow decision tree to its entirety
Trim the nod es of the decision tree in a bottom-up fashion
If generalisation error improves after trimming, replace sub-tree by a leaf node
Class label of leaf node is determined from majority class of instances in the sub-tree

Question 19

Q

Pros and cons of decision trees

Answer

Study These Flashcards

A

Advantages:

Easy to understand (domain experts love them)
Easy to generate rules

Disadvantages

May suffer from overfitting
Classifies by rectangular partitioning (so does not handle correlated features very well)
Can be quite large - pruning is necessary
Does not handle streaming data easily
- … but a few successful ideas/techniques exist

Lecture 9 - Decision Trees Flashcards

(19 cards)