6. Frequent Itemsets Flashcards
What is the market basket model?
Depicts data as a large set of items which may fall into any number of a large amount of baskets. Each basket is a small subset of items
What is the goal of the market basket model?
We want to find association rules.
Ex: People who bought {x, y, z} tend to buy {v, w}
What is support?
The number of baskets containing all items in I for a given itemset I
What is a frequent itemset?
Sets of items that appear in at least s baskets where s is the support threshold
What is an association rule?
If-then rules about the contents of a basket
{i1, i2, …, ik} -> j means if a basket contains i1…ik it is likely to contain j
How do we calculate the confidence of an association rule?
conf(I -> j) = support(I U j) / support(I)
How do we calculate the interest of an association rule and when do we say a rule is interesting?
Interest(I -> j) = conf(I -> j) - Pr[j]
Interesting rules have high positive or negative interest values (usually above 0.5)
What is the process for mining association rules?
- Find all frequent itemsets I
- For every subset A of I generate rule A -> I \ A
- Output rules above the confidence threshold
The rationale behind this is that since I is frequent, A is also frequent
What is the bottleneck for finding frequent itemsets?
Main memory is the bottleneck.
We need to count something within the baskets but the number of different things we can count is limited by main memory
What is the hardest problem for finding frequent itemsets and why?
Finding frequent pairs is hardest because frequent pairs are common and we don’t have enough memory to store them all. The probability of being frequent drops exponentially with size.
What is the naive algorithm for finding frequent pairs?
Read the file once, counting in main memory the occurrences of each pair.
From each basket of n items we generate its n(n-1)/2 pairs by two nested loops
What is the issue with the naive algorithm for finding frequent pairs?
It fails if the number of items squared exceeds main memory
What is the triangular matrix approach to counting frequent pairs?
We count a pair of items {i, j} if i < j
Keep pair counts in lexicographic order
What is the triples approach to counting frequent pairs?
We keep a table of triples [i, j, c] where the count of the pair {i, j} is c
How much memory is used by the triangular matrix and triples approach for counting frequent pairs?
Triangular matrix uses 4 bytes per pair
Triples uses 12 bytes per occurring pair but only for pairs with count > 0