Association Flashcards
Define support and confidence formulas
Support(ABC) = o(ABC) / o(N) Confidence(AB->C) = o(ABC) / o(AB)
What do you know about (AB->C), (A->BC), (AC->B) and (ABC -> C)
Top 3, numerators are all the same. Bottom, doesn’t make sense.
Apriori Principle
If support of a superset is large, the subsets must be large. If a subset is small, the supersets must be small.
Anti-monotone principle
This phenom is called what?
Use the books word and describe monotone and anti-monotone.
Similar if not identical to apriori, we know if we find all subsets that are “interesting”, we don’t have to worry about supersets being more interesting.
Upward closed.
Monotone: if a subset satisfies a constraint, so does supersets.
Anti-monotone: if a subset violates, so does supersets.
How many database scans does an association using the Apriori method require?
Without apriori, As many as your largest k-item set. With apriori, as many as largest ‘interesting’ k-itemset.
4 ways to Generate candidate k-itemsets.
Which require pruning?
- Brute force: Generate every possible k-itemset. Prune.
- SQL Join: Join the DB table of k-1 itemsets to itself. Join logic: first two items identical, third item different. Sounds familiar.
- Fk-1 x F1 : Combine each candidate k-1 item with each candidate F1 item.Avoid duplicates by ordering each set of sets into lexicographical order.Prune.
- Fk-1 x Fk-1: Merge candidate Fk-1 sets only if thier first k-2 items are identical and the last item different.
All methods require pruning because all run the risk of generating too many candidates.
An itemset is closed if…
An itemset is maximal if…
Closed: X is closed if all supersets have less support.
Closed sets will be saved. Non-closed sets can be discarded for convenience.
Maximal: X is maximal if none of its supeersets are frequent.
Given, this list of closed frequent itemsets:
{C} = 0.3 , {A,C} = 0.2
{B,E} = 0.3 , {B,C,E} = 0.2
What do you know about:
E? EC? D? A? B?
E = .3
EC = .3
A = B = 0.2
D we don’t know.
What benefit comes from knowing maximal itemsets?
From it, you can derive all frequent itemsets.
You don’t know their support, but you know they are “important”.
Confidence based pruning:
Which of these do we care about?
ABC->D > threshold (pass)
ABC-D < threshhold
C->ABD > threshold
C->ABD < threshold
ABC->D under threshold.
We can prune all other sub-sets of ABC.
AB->CD, AC->BD, BC->AD is OUT.
How to implement Multiple Minimum Supports?
What does this do to your monotonoisity?
Minimum support based on 1-itemsets.
MST(Coke) = 3%
MST(Diamonds) = 0.1%
MST(Coke, Diamonds) = min(mst(coke),mst(diamonds)) = 0.1%
Result is no longer anti-monotone.
Give examples of constraints that are monotone and anti-monotone.
Sum and LessThan are anti-monotone
What datatypes are common for attributes?
Boolean (binomail)
Discrete (binomial)
Integer like age
Ordinal - ordered
Whats an eager classification method?
What is the other type, give examples of both
Eager methods produce the decision tree in advance.
Eager:
Hunt’s recursive decision tree algorithm.
Lazy:
Naive Bayes
Name some preprocessing steps
Fusing daa from multiple sources
Cleaning: Remove duplicates and noise
Selecting relevant records
Whats post processing
Usually data visualization
Verify good data, statistical measures and hypothosys testing
“Ensures only valid and useful results enter Decision Support System
Classification Vocab.
Give descriptive terms for atttributes in classification.
Target variable, is sometimes called dependent variable.
The other attributes are called explanitory or independent variables.
What are two major methods used in cluster analysis
K-Means: (sounds similar to genetic programming)
Hierarchical: make like a family tree
Density Method
What is usually larger? Support ratio or Confidence Ratio?
Confidence always.
Support = o(ab)/Total
Confidence = o(ab)/o(a)
Name some places where anti-monotonicity is handy
Obviously, apriori support counting.
Rule generation (if abc->d doesn’t satisfy, then ab->cd won’t).
Candidate generation phase for Constraints Mining
Discuss why infrequent itemsets can be important.
If X+Y is infrequent, especially if minsup is very low,
That makes it likely that mixed itemsets X-Y or Y-X or -Y-X is interesting.
Define lift, and how you screwed up the homework
Lift of X->Y is
sup(XY)
sup(X) *sup(Y)
There’s an exttra N in there. sup(XY)and sup(X) both have an N, so they cancel, but sup(Y) has an N.
How is suppport and confidence effected as you go down a hirearchy,
say from Computer to Laptop to Dell
For Computer -> Printer
Support decreases, there are less dells out there than printers.
Confidence increases, the denominator shrinks.
How does decreasing support threshold increase computational complexity?
Two ways:
You get more frequent candidates, so more counting.
You get larger itemsets, so more passes over the data.
What 3 factors effect association computational complexity?
Number of items
Number of transactions
Average Transaction Width
3 factors we use to descrbe itemset generation algorithms (and more!)
Complete: finds all the itemsets
Sound: finds only interesting itemsets.
Efficient. Avoids generating same itemset twice.