L4 - Decision Tree Flashcards
What are the learning objectives for classification using Decision Trees?
Describe the concept of a decision tree
Explain how to build a decision tree
Analyse strengths and weaknesses
Apply decision tree
What are the lesson outlines?
What is a decision tree
Building/growing a tree (e.g. information gain and entropy)
Pruning the tree
Improving the tree (e.g. ensemble method: AdaBoost)
Define decision tree?
Use a tree structure to model the relationship among the features and outcomes
Why use decision tree over other classifiers?
(1) Classification mechanism needs to be transparent (e.g. credit scoring process)
We need to understand the classification rules, decision process and criteria that was used to get to that decision (e.g. to prevent bias and provide transparent process)
(2) Results need to be shared with others for future business practice
Key idea for DT?
Divide and conquer
How to build a decision tree?
(1) Split the data into subsets (by feature)
(2) Split those subsets repeatedly into smaller subsets
(3) Repeat process until data within the subsets are sufficiently homogenous (e.g. most samples at the node of the same class or until it reaches a predefined size limit)
What does sufficiently homogenous mean for DT?
Most samples at the node have the same class There are no remaining features to distinguish among samples Reach the predefined size limit
What is entropy?
Measurement of uncertainty in a data set
Give the equation for entropy? Explain the components?
Entropy (S) = ∑C - Pi* log(pi) S = dataset C = number of classes Pi = proportion of samples in class i
If there is more than one class then there will be another segment of the above equation.
Explain entropy with following information: a set of data S has two classes: red (60%) and blue (40%). Use a graph to explain?
Entropy(S) = - 0.6 * log2(0.6) – 0.4 * log2(0.4) = 0.97 0.6 is the red class and also the red proportion (0.6) Steps - (a) split features with largest information gain (b) measure entropy in each class
Interpret an entropy value?
Measure of uncertainty between 0 and 1
0 - means 0 uncertainty (therefore certainty)
1 - means completely uncertain
Explain entropy for 2 classes or n classes?
Two classes entropy range from 0 to 1
N classes entropy ranges from 0 to log2(n)
How does entropy link to homogeneity?
0 entropy means completely homogenous
1 entropy means complete heterogeneity (completely diverse)
Explain entropy and class X in an X Y sample graphically?
Entropy - level of uncertainty (homogeneity)
X - the proportion of X in the sample
If X is 100% then then entropy is 0 because there is complete certainty
If X is 0% then entropy is 0 because there is complete certainty
If X is 50% in the sample then entropy is 1 because this is the largest possible amount of uncertainty that could exist in the data set
How to measure homogeneity?
Use entropy as a measure of uncertainty
Uncertainty = heterogeneity
Certainty = homogeneity