Chp 5 Decision Tree Flashcards
Greedy Strategy
Split the records based on an attribute test that optimizes certain criterion
Binary Split
Divides values into two subsets
Multi-way split
Use as many partitions as there are distinct values
How to specify continuous attributes
sort attribute
create split positions, at halfway points
Determine gini index
The lower the gini index
the better it is
Gini Index
Entropy
Amount of uncertainty involved in a distribution
In decision tree algorithms, entropy measures
Purity
Purity
The fraction of observations belonging to a particular class in a node
Pure node
If all observations belong to the same class
Information
Amount of certainty involved in a distribution
We need to choose a split in a decision tree that maximizes
Information Gain
Information Gain formula
Entropy before split - Entropy after split
Steps to determine split?
Calculate entropy at root node
Calculate each information gain for each attribute split
Pick the attribute split that has the highest information gain
Decision trees are non/parametric
Non parametric because you are not specifying any parameters
Time complexity of building a decision tree
O(log n)
Does multicollinearity affect decision tree accuracy?
No, just added extra height
Finding optimal decision tree is
expensive, NP complete
Decision tree can create
rectilinear boundaries
Decision trees cannot create
non straight (only up or down) lines
Model =
Algorithm + hypothesis
4 steps of selecting a model
Prepare training data
Choose hypothesis set and algorithm
Tune algorithm
Train the model, fit the model to out of sample data (test set) and evaluate results