L19 - Classification Trees Flashcards
What is a decision tree?
A supervised classification (and sometimes regression) algorithm.
What are the 2 common subtrees of decision trees?
Regression Tree
Classification Tree
What is the goal of a classification tree? What rules does it follow to achieve this?
Goal : Make a classification based on simple decision rules that’s learnt from historic data.
What is a classification tree and how does it work?
- A set of conditional statements. New data follows a flow of branches based on their response to the conditionals in the tree.
- A classification is made when the new data reaches a leaf. The lead is the class that will be assigned.
Can a tree hold a mixture of data types?
Yes.
What are the steps for interpreting a classification tree?
- Compare the new data against the condition of the root node.
- Follow either the true or false branch.
- Repeat until a leaf is reached.
When building a classification tree, what is the process followed for choosing the root statement?
Using historic data, establish the relationships between features and the dependent variable.
For each feature, create a small tree such that statement is root, true and false branches lead to decision leaves.
The statement with the most pure leaves should be chosen. However if an impure leaf is encountered, we need to attend to it.
What are the 2 methods in measuring leaf impurity?
Percentage : In which we calculate the true percentage via (true P’s / total P’s).
Gini’s impurity : Establish the Gini Impurity for all roots.
What are the 3 steps of Gini’s Impurity?
- For each feature being tested as root, find the Gini Impurity of the leaves of the feature-root test trees.
- Calculate the total GI for the feature via the equation Total GI = leftGI + rightGI
- Repeat for all features.
- Feature with lowest GI is set as the root of the classification tree.
- Repeat throughout to establish tree.
What is the equation of Gini’s impurity?
1 - ( (prob. of yes^2) - (prob. of no^2) )
What is pruning and why is it needed?
Pruning is a data compression technique that removes non-critical components of a classification tree.
It is needed to prevent overfitting.
What can cause a classification tree to be overfit?
Continuously attempting to remove impurities can cause overfitting.
What is post-pruning?
Pruning occurs after the classification tree has been built.
Pruning is done to the point at which the cross-validated error is at a minimum.
What is pre-pruning?
Stop the tree building process before it produces leaves with very small samples. Thus, preventing overfitting.
Upon each splitting of the tree, the cross-validation error is measured. If not substantial increase, the tree creation is stopped.