Classification Flashcards

1
Q

What are the two types of Classification methods, build tree first or create model for each query.

A

Eager builds the tree first. Creating model takes a while, processing queries doesn’t.
Lazy doesn’t build a tree. Queries take longer. Lazy adapts better to new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is entropy for a 50/50 split?
What is entropy for a 25,25,25,25 split (one attribute)?
What is entropy for two 50/50 splits?

A

50-50: 1.
25,25,25,25: 2
(50/50)(50/50): 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Information Gain and Gain Ratio

A

Gain: old entropy - new entropy
Ratio: new Entropy / Old Entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Gini for 50/50 split?

Gini approaches 1 as the split gets worse.

A

50/50: .5

.25,.25.25.25: .75

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Three ways to turn a numerical value into an ordinal value. (Whats this called?)

A

Discretizing:

1) Equal Width: select arbitrary range size
2) Equal Depth: Select cluster size
3) Distance-based ???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Whats an entropy based Discretization method

A

Done recursivly. For all possible splits in your data, find the split that makes entropy lowest. Repeat on either side of split…
Do as long as information gain is greater than threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you pre-prune or post-prune a decision tree?

A

Prepruning: accomplished by an information gain threshold
Post-pruning: Sub-tree replacement.
Sub-tree raising.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

3 methods for dividing sample and test data

A

Holdout: divide a dataset into 2/3 training 1/3 test.
Random Sampling: Train on 100% of data, test on random subsets.
Cross Validation. (K-fold subsets)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Discribe a method to test your classifier that penalizes mis-predictions

A
Cost matrix.  Confusion matrix where costs are assigned to each quadrant.
In cancer screen example, 
TP = -1
FP = 1
FN = 100
TN = 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Whats the rule-based eager method?

A

Sequential Covering Algorithm.

Ponder: “As rules grow, certanty increases, coverage decreases.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What weakness of trees does Sequential Covering alg really beat

A

Subtree Duplication:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Two attributes of a good rule set

A

Exclusive and Exaustive:
Exclusive: no entry wll match two rules.
Exaustive: any inbound query will hit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What two attributes do you ranke rules on?

A

Coverage: Fraction of records that satisfy Antecedent.
Accuracy: Fraction of records that satisfy both!
If X, then Y.
Lots of X’s, high coverage. Lots of XY, then good Accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Naive Bayes probability summary

A

For all attributes, you need the percentage of (d1 | X). So of 9 days to play golf, 3 of them were sunny. 1/3.
Multiply all of those times the P(X)/P(d1d2d3X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Remind yourself how to handle zeros in data

A

Its that shit where you add one of each ordinal…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

SVM: Support Vector Machine basics

A

This is the one that trys to find a hyperplane to classify all records on.

17
Q

What are four applications of classification analysis

A
Explaining Why.  
Credit approval
Target Marketing (likely to buy)
Is medical treatment effective?
Medical Diagnosis
18
Q

What is KNN classification?

A

K-Nearest Neighbors. I’m betting it is cosine of attribute vectors.

19
Q

List all the lazy and non-lazy methods you can think of.

A

Lazy: KNN, Range-query (similar to knn). Naive Bayse
Eager: Support vector machine?
Tree Generation
Sequential Covering Algorithm

20
Q

Supervised versus Unsupervised learning

A
Supervised (like tree generation) Training data has labels or classes.  This is classification.
Unsupervised, class labels are unknown, try to discover classes/labels.  (clustering)
21
Q

What is an advantage of Gain Ratio over Information Gain

A

Gain ratio tries to penalize larger splits (split on first-name example)

22
Q

What is a class-based discretion process

A
Place breakpoints in a numeric data.  
1).  Place breakpoints in between numeric values where the class changes.
2).  Set a minimum number of values to have in a class.  Place breakpoints where majority of class changes.
(YNYYY) (NNYYY) (NYYN)