Chp 5 Decision Tree Flashcards
Greedy Strategy
Split the records based on an attribute test that optimizes certain criterion
Binary Split
Divides values into two subsets
Multi-way split
Use as many partitions as there are distinct values
How to specify continuous attributes
sort attribute
create split positions, at halfway points
Determine gini index
The lower the gini index
the better it is
Gini Index
Entropy
Amount of uncertainty involved in a distribution
In decision tree algorithms, entropy measures
Purity
Purity
The fraction of observations belonging to a particular class in a node
Pure node
If all observations belong to the same class
Information
Amount of certainty involved in a distribution
We need to choose a split in a decision tree that maximizes
Information Gain
Information Gain formula
Entropy before split - Entropy after split
Steps to determine split?
Calculate entropy at root node
Calculate each information gain for each attribute split
Pick the attribute split that has the highest information gain
Decision trees are non/parametric
Non parametric because you are not specifying any parameters
Time complexity of building a decision tree
O(log n)
Does multicollinearity affect decision tree accuracy?
No, just added extra height
Finding optimal decision tree is
expensive, NP complete
Decision tree can create
rectilinear boundaries
Decision trees cannot create
non straight (only up or down) lines
Model =
Algorithm + hypothesis
4 steps of selecting a model
Prepare training data
Choose hypothesis set and algorithm
Tune algorithm
Train the model, fit the model to out of sample data (test set) and evaluate results
Goal of model selection
Select the best model from training phase
Two ways to evaluate models
Model Checking
Performance Estimation
The best model is the one that gives you _ and _ well on testing set
Smallest prediction error
Generalizes
Model Checking
Given a data set, divide it into training and testing dataset
What happens if you randomly select test points that are not representative of the population in general?
Cross Validation
Cross Validation approaches 3
Holdout
K-fold cross validation
Leave one out cross validation
Cross validation
an approach to
systematically create and evaluate multiple models on
multiple subsets of the dataset
Holdout method
Split 80% training data and 20% testing data randomly
k-fold cross validation
Split data into k chunks, train on k-1 chunks and test on the kth chunk. Do this k times and calculate average error
Leave one out cross-validation
Extreme version of k-fold where k=1 observation (n chunks)
Use of kfold or LOOC resampling methods are more robust if
data is split into training, validation, and testing
Typical application of holdout methods is to
determine a stopping point with respect to error, stop when test set error starts increasing (this is overfitting)
why split data into three part?
If you model has certain hyper parameters, then you can adjust the hyperparameters on the validation dataset
two ways to do performance evaluation of a model
Confusion Matrix
Receiver Operating Characteristics (ROC) curve
Confusion Matrix
Provides numerous metrics computed from the matrix
ROC curve
Characterize the trade off between positive hits and false alarms
Can you minimize both FP and FN?
No
What does the ROC plot?
The true positive rate against the false positive rate
What does the middle line mean
a random guess
AUC standards
.5-.6 fail
.6-.7 worthless
.7-.8 poor
.8-.9 good
>.9 excellent
Overfitting
picking up nuances in training data, matches too much with training data
Underfitting
model is too simple that it cant capture patterns
Prepruning
halt growth of tree based on some constraint
good for shorter trees
dont know when to stop
Post pruning
Grow tree to maximum size, then trim
Gives better results
Wastes computer cycles
How to prune
Focus on complexity parameter (cp), keep splitting until cp reaches a certain value
How to find cp
Use cross validation error and see where it starts to rise again after decreasing, use cp at this value
What does pruning do
Gives tree that is more generalized
Mostly all datasets have what type of class distribution
imbalanced
Cost sensitive learning
Penalizes the model when it commits a false negative error
Mitigation balance techniques 3
Cost sensitive learning
Sampling techniques
Synthetic data
Synthetic data
May be generated, if possible, to ensure that the class distribution is equivalent
Sampling techniques
modify the class distribution such that the rare class is well represented in the training set
Undersampling + con
Gathers less of the majority class observations for training
Useful observations may not be part of the sample
Oversampling + con
gathers more of the minority class observations for training
If training data is noise, oversampling may amplify the noise