classification (decision trees) Flashcards
What is the goal of decision trees?
To create a model that predicts the value of a target variable by learning simple decision rules inferred from features
What are decision trees?
A non-parametric supervised learning method used for classification
What are benefits of decision trees?
- don’t need as much data and gives rules that it learned
- no underlying assumptions
- handles multidimensional data
- achieves good accuracy
What is the splitting criterion?
Tells us which attribute to test at node N by determining the “best” way to separate or partition tuples into individual classes
What does it mean for a partition to be pure?
If all tuples belong to the same class
What does information gain do?
Minimizes the info needed to classify tuples in resulting partitions and reflects least randomness or “impurity”
(T/F): Info gain guarantees that a simple tree is found
True
What does Gain(A) tell us?
Tells us how much would be gained by branching on A. It is the difference between original info required and new info required.
How can you determine the “best” split point for A if it is CONTINUOUS-VALUED?
- sort of values of A in increasing order
- consider midpoint as split point ( ( ai + ai + 1 ) / 2 )
(T/F): For info gain, pick the feature and split that MAXIMIZES weighted entropy
False; pick the feature and split that MINIMIZES weighted entropy
What kind of DT is info gain used in?
ID3
What kind of DT is the Gini index used in?
CART
What is Gini Index?
An alternative measure of impurity of class labels among a particular set of tuples
How many possible ways are there to form 2 partitions of the data based on a binary split on A?
2^v - 2 ways
(T/F): We want to MINIMIZE the gini index
True