6. Frequent Itemsets Flashcards
What is the market basket model?
Depicts data as a large set of items which may fall into any number of a large amount of baskets. Each basket is a small subset of items
What is the goal of the market basket model?
We want to find association rules.
Ex: People who bought {x, y, z} tend to buy {v, w}
What is support?
The number of baskets containing all items in I for a given itemset I
What is a frequent itemset?
Sets of items that appear in at least s baskets where s is the support threshold
What is an association rule?
If-then rules about the contents of a basket
{i1, i2, …, ik} -> j means if a basket contains i1…ik it is likely to contain j
How do we calculate the confidence of an association rule?
conf(I -> j) = support(I U j) / support(I)
How do we calculate the interest of an association rule and when do we say a rule is interesting?
Interest(I -> j) = conf(I -> j) - Pr[j]
Interesting rules have high positive or negative interest values (usually above 0.5)
What is the process for mining association rules?
- Find all frequent itemsets I
- For every subset A of I generate rule A -> I \ A
- Output rules above the confidence threshold
The rationale behind this is that since I is frequent, A is also frequent
What is the bottleneck for finding frequent itemsets?
Main memory is the bottleneck.
We need to count something within the baskets but the number of different things we can count is limited by main memory
What is the hardest problem for finding frequent itemsets and why?
Finding frequent pairs is hardest because frequent pairs are common and we don’t have enough memory to store them all. The probability of being frequent drops exponentially with size.
What is the naive algorithm for finding frequent pairs?
Read the file once, counting in main memory the occurrences of each pair.
From each basket of n items we generate its n(n-1)/2 pairs by two nested loops
What is the issue with the naive algorithm for finding frequent pairs?
It fails if the number of items squared exceeds main memory
What is the triangular matrix approach to counting frequent pairs?
We count a pair of items {i, j} if i < j
Keep pair counts in lexicographic order
What is the triples approach to counting frequent pairs?
We keep a table of triples [i, j, c] where the count of the pair {i, j} is c
How much memory is used by the triangular matrix and triples approach for counting frequent pairs?
Triangular matrix uses 4 bytes per pair
Triples uses 12 bytes per occurring pair but only for pairs with count > 0
When is the triples approach for counting frequent pairs better than triangular matrix?
It is better if less than 1/3 of possible pairs actually occurs
What is the issue with the triples and triangular matrix approaches to counting frequent pairs?
We may not have enough memory to fit the pairs. Swapping counts in and out of disk is not a good solution
What are the 2 key ideas that they A-Priori algorithm is based on?
- Monotonicity: If a set of items I appears at least s times so does every subset of I
- Contrapositive for pairs: If item i does not appear in s baskets then no pair including i can appear in s baskets
What happens during the first pass of the A-Priori algorithm?
Read baskets and count in main memory the occurrences of each individual item. Any items that appear more than s times are frequent
What happens during the second pass of the A-Priori algorithm?
Read baskets again and count in main memory only the pairs where both elements are frequent
What memory is required for each pass of the A-Priori algorithm?
Pass 1: Memory proportional to number of items
Pass 2: Memory proportional to square of frequent items only plus a list of frequent items so you know what needs to be counted
How do we generalize A-Priori to work for frequent triples and above?
- For each k we construct 2 sets of k-tuples
- Ck are the candidate k-tuples that may be frequent sets based on information from the pass for k-1
- Lk contains the set of truly frequent k-tuples
- We start with all the items
- Count the items to get the set of frequent items
- Construct a new set of potential 2-tuples using results from above
- Repeat
What is the issue with the A-Priori algorithm?
In pass 1, most memory is idle. We are only storing individual item counts
What do we do in the first pass of the PCY algorithm that is different from A-Priori?
We maintain a hash table with as many buckets as fit in memory in addition to item counts. We keep a count for each bucket into which pairs of items are hashed. We do not record the pair, just the count that hash to the bucket
What does the count of a PCY bucket tell us?
If total count is less than s, none of its pairs can be frequent.
If a bucket contains a frequent pair then the bucket is frequent
What do we do during the second pass of the PCY algorithm?
Only count pairs that hash to buckets found to be frequent
What memory is required for each pass of the PCY algorithm?
Pass 1: Memory for item counts and a hash table for pairs
Pass 2: Memory for frequent items, buckets represented as a bitmap, and counts of candidate pairs