Association Flashcards
Define support and confidence formulas
Support(ABC) = o(ABC) / o(N) Confidence(AB->C) = o(ABC) / o(AB)
What do you know about (AB->C), (A->BC), (AC->B) and (ABC -> C)
Top 3, numerators are all the same. Bottom, doesn’t make sense.
Apriori Principle
If support of a superset is large, the subsets must be large. If a subset is small, the supersets must be small.
Anti-monotone principle
This phenom is called what?
Use the books word and describe monotone and anti-monotone.
Similar if not identical to apriori, we know if we find all subsets that are “interesting”, we don’t have to worry about supersets being more interesting.
Upward closed.
Monotone: if a subset satisfies a constraint, so does supersets.
Anti-monotone: if a subset violates, so does supersets.
How many database scans does an association using the Apriori method require?
Without apriori, As many as your largest k-item set. With apriori, as many as largest ‘interesting’ k-itemset.
4 ways to Generate candidate k-itemsets.
Which require pruning?
- Brute force: Generate every possible k-itemset. Prune.
- SQL Join: Join the DB table of k-1 itemsets to itself. Join logic: first two items identical, third item different. Sounds familiar.
- Fk-1 x F1 : Combine each candidate k-1 item with each candidate F1 item.Avoid duplicates by ordering each set of sets into lexicographical order.Prune.
- Fk-1 x Fk-1: Merge candidate Fk-1 sets only if thier first k-2 items are identical and the last item different.
All methods require pruning because all run the risk of generating too many candidates.
An itemset is closed if…
An itemset is maximal if…
Closed: X is closed if all supersets have less support.
Closed sets will be saved. Non-closed sets can be discarded for convenience.
Maximal: X is maximal if none of its supeersets are frequent.
Given, this list of closed frequent itemsets:
{C} = 0.3 , {A,C} = 0.2
{B,E} = 0.3 , {B,C,E} = 0.2
What do you know about:
E? EC? D? A? B?
E = .3
EC = .3
A = B = 0.2
D we don’t know.
What benefit comes from knowing maximal itemsets?
From it, you can derive all frequent itemsets.
You don’t know their support, but you know they are “important”.
Confidence based pruning:
Which of these do we care about?
ABC->D > threshold (pass)
ABC-D < threshhold
C->ABD > threshold
C->ABD < threshold
ABC->D under threshold.
We can prune all other sub-sets of ABC.
AB->CD, AC->BD, BC->AD is OUT.
How to implement Multiple Minimum Supports?
What does this do to your monotonoisity?
Minimum support based on 1-itemsets.
MST(Coke) = 3%
MST(Diamonds) = 0.1%
MST(Coke, Diamonds) = min(mst(coke),mst(diamonds)) = 0.1%
Result is no longer anti-monotone.
Give examples of constraints that are monotone and anti-monotone.
Sum and LessThan are anti-monotone
What datatypes are common for attributes?
Boolean (binomail)
Discrete (binomial)
Integer like age
Ordinal - ordered
Whats an eager classification method?
What is the other type, give examples of both
Eager methods produce the decision tree in advance.
Eager:
Hunt’s recursive decision tree algorithm.
Lazy:
Naive Bayes
Name some preprocessing steps
Fusing daa from multiple sources
Cleaning: Remove duplicates and noise
Selecting relevant records