Classification Flashcards

Question 1

Q

What are the two types of Classification methods, build tree first or create model for each query.

Answer

A

Eager builds the tree first. Creating model takes a while, processing queries doesn’t.
Lazy doesn’t build a tree. Queries take longer. Lazy adapts better to new data.

Question 2

Q

What is entropy for a 50/50 split?
What is entropy for a 25,25,25,25 split (one attribute)?
What is entropy for two 50/50 splits?

Answer

A

50-50: 1.
25,25,25,25: 2
(50/50)(50/50): 1

Question 3

Q

Information Gain and Gain Ratio

Answer

A

Gain: old entropy - new entropy
Ratio: new Entropy / Old Entropy

Question 4

Q

What is Gini for 50/50 split?

Gini approaches 1 as the split gets worse.

Answer

A

50/50: .5

.25,.25.25.25: .75

Question 5

Q

Three ways to turn a numerical value into an ordinal value. (Whats this called?)

Answer

A

Discretizing:

1) Equal Width: select arbitrary range size
2) Equal Depth: Select cluster size
3) Distance-based ???

Question 6

Q

Whats an entropy based Discretization method

Answer

A

Done recursivly. For all possible splits in your data, find the split that makes entropy lowest. Repeat on either side of split…
Do as long as information gain is greater than threshold

Question 7

Q

How do you pre-prune or post-prune a decision tree?

Answer

A

Prepruning: accomplished by an information gain threshold
Post-pruning: Sub-tree replacement.
Sub-tree raising.

Question 8

Q

3 methods for dividing sample and test data

Answer

A

Holdout: divide a dataset into 2/3 training 1/3 test.
Random Sampling: Train on 100% of data, test on random subsets.
Cross Validation. (K-fold subsets)

Question 9

Q

Discribe a method to test your classifier that penalizes mis-predictions

Answer

A

Cost matrix.  Confusion matrix where costs are assigned to each quadrant.
In cancer screen example, 
TP = -1
FP = 1
FN = 100
TN = 0

Question 10

Q

Whats the rule-based eager method?

Answer

A

Sequential Covering Algorithm.

Ponder: “As rules grow, certanty increases, coverage decreases.”

Question 11

Q

What weakness of trees does Sequential Covering alg really beat

Answer

A

Subtree Duplication:

Question 12

Q

Two attributes of a good rule set

Answer

A

Exclusive and Exaustive:
Exclusive: no entry wll match two rules.
Exaustive: any inbound query will hit.

Question 13

Q

What two attributes do you ranke rules on?

Answer

A

Coverage: Fraction of records that satisfy Antecedent.
Accuracy: Fraction of records that satisfy both!
If X, then Y.
Lots of X’s, high coverage. Lots of XY, then good Accuracy

Question 14

Q

Naive Bayes probability summary

Answer

A

For all attributes, you need the percentage of (d1 | X). So of 9 days to play golf, 3 of them were sunny. 1/3.
Multiply all of those times the P(X)/P(d1d2d3X)

Question 15

Q

Remind yourself how to handle zeros in data

Answer

A

Its that shit where you add one of each ordinal…

Question 16

Q

SVM: Support Vector Machine basics

Answer

Study These Flashcards

A

This is the one that trys to find a hyperplane to classify all records on.

Question 17

Q

What are four applications of classification analysis

Answer

Study These Flashcards

A

Explaining Why.  
Credit approval
Target Marketing (likely to buy)
Is medical treatment effective?
Medical Diagnosis

Question 18

Q

What is KNN classification?

Answer

Study These Flashcards

A

K-Nearest Neighbors. I’m betting it is cosine of attribute vectors.

Question 19

Q

List all the lazy and non-lazy methods you can think of.

Answer

Study These Flashcards

A

Lazy: KNN, Range-query (similar to knn). Naive Bayse
Eager: Support vector machine?
Tree Generation
Sequential Covering Algorithm

Question 20

Q

Supervised versus Unsupervised learning

Answer

Study These Flashcards

A

Supervised (like tree generation) Training data has labels or classes.  This is classification.
Unsupervised, class labels are unknown, try to discover classes/labels.  (clustering)

Question 21

Q

What is an advantage of Gain Ratio over Information Gain

Answer

Study These Flashcards

A

Gain ratio tries to penalize larger splits (split on first-name example)

Question 22

Q

What is a class-based discretion process

Answer

Study These Flashcards

A

Place breakpoints in a numeric data.  
1).  Place breakpoints in between numeric values where the class changes.
2).  Set a minimum number of values to have in a class.  Place breakpoints where majority of class changes.
(YNYYY) (NNYYY) (NYYN)

Classification Flashcards

(22 cards)