Classification Flashcards
What are the two types of Classification methods, build tree first or create model for each query.
Eager builds the tree first. Creating model takes a while, processing queries doesn’t.
Lazy doesn’t build a tree. Queries take longer. Lazy adapts better to new data.
What is entropy for a 50/50 split?
What is entropy for a 25,25,25,25 split (one attribute)?
What is entropy for two 50/50 splits?
50-50: 1.
25,25,25,25: 2
(50/50)(50/50): 1
Information Gain and Gain Ratio
Gain: old entropy - new entropy
Ratio: new Entropy / Old Entropy
What is Gini for 50/50 split?
Gini approaches 1 as the split gets worse.
50/50: .5
.25,.25.25.25: .75
Three ways to turn a numerical value into an ordinal value. (Whats this called?)
Discretizing:
1) Equal Width: select arbitrary range size
2) Equal Depth: Select cluster size
3) Distance-based ???
Whats an entropy based Discretization method
Done recursivly. For all possible splits in your data, find the split that makes entropy lowest. Repeat on either side of split…
Do as long as information gain is greater than threshold
How do you pre-prune or post-prune a decision tree?
Prepruning: accomplished by an information gain threshold
Post-pruning: Sub-tree replacement.
Sub-tree raising.
3 methods for dividing sample and test data
Holdout: divide a dataset into 2/3 training 1/3 test.
Random Sampling: Train on 100% of data, test on random subsets.
Cross Validation. (K-fold subsets)
Discribe a method to test your classifier that penalizes mis-predictions
Cost matrix. Confusion matrix where costs are assigned to each quadrant. In cancer screen example, TP = -1 FP = 1 FN = 100 TN = 0
Whats the rule-based eager method?
Sequential Covering Algorithm.
Ponder: “As rules grow, certanty increases, coverage decreases.”
What weakness of trees does Sequential Covering alg really beat
Subtree Duplication:
Two attributes of a good rule set
Exclusive and Exaustive:
Exclusive: no entry wll match two rules.
Exaustive: any inbound query will hit.
What two attributes do you ranke rules on?
Coverage: Fraction of records that satisfy Antecedent.
Accuracy: Fraction of records that satisfy both!
If X, then Y.
Lots of X’s, high coverage. Lots of XY, then good Accuracy
Naive Bayes probability summary
For all attributes, you need the percentage of (d1 | X). So of 9 days to play golf, 3 of them were sunny. 1/3.
Multiply all of those times the P(X)/P(d1d2d3X)
Remind yourself how to handle zeros in data
Its that shit where you add one of each ordinal…