Data Mining - Lecture Decision Tree's Flashcards
Why is decision tree popular classification technique?
- Performs well in a wide range of situations
- Does not require much effort from the analyst
- Easy to understand
In each nodes there are brackets. Based on the lecture, what do we assume is the order in the brackets?
[Non acceptor, acceptor] –> meaning [Negative, Positive]
Unless it is stated otherwise.
Where do we look at if we either want to look at incorrectly predicted or the amount of TN/TP/FP/FN ?
At the leave nodes and then to their color and the values in their brackets.
Which two types of split do you have for Nominal attributes?
- Multi-way split.
You can split it in as many categories you want. - Binary split
You split it in two subsets. Might need to combine attributes.
Which two types of split do you have for continious attributes?
- Descretization
Basically a multi-way split but each category is a range of values then. - Binary decision.
Two subsets. You have to find the best cut among possible splits
How do you determine the best split?
You look at the Information gain per split. You compute a measure of impurity that can either be Gini Index or Entropy and than you look at which split has the highest number.
The lower the Gini, the higher the information gain the better.
You use this when you are comparing splits!!!!!
How do you compute the GINI index?
If you have a split, you have two (or more) classes.
For each class, you divide the #records in that class by the #records of that node level. This way you have the proportion per class.
You square those proportions and you subtract them both from 1. That is your GINI.
Example: Class 1 has 2 and Class 2 has 4. Total of node level = 6.
1 - (2/6)^2 - (4/6)^2 = GINI
What are the number in the nodes?
The main number is the total amount in the node.
The number in the brackets is the amount per class.
Remember to compute TP, TN etc. only with the leave nodes.
What is the combined impurity?
You calculate the GINI index for both nodes in which a layer above is split.
You then perform a weighted average to get the combined GINI:
((#records Node 1 /((#records node 1 + 2) * GINI Node 1) )+ ((# Records node 2 / ((#Records node 1 + 2) * GINI Node 2)
What is Entropy measure?
Similar to GINI, but a different computation.
- (proportie node 1)log2 (proportie node 1) - (proportie node 2)log2(proportie node 2) etc.