6. Frequent Itemsets Flashcards by Branden Wheeler

What is the market basket model?

Depicts data as a large set of items which may fall into any number of a large amount of baskets. Each basket is a small subset of items

How well did you know this?

Not at all

Perfectly

What is the goal of the market basket model?

We want to find association rules.
Ex: People who bought {x, y, z} tend to buy {v, w}

How well did you know this?

Not at all

Perfectly

What is support?

The number of baskets containing all items in I for a given itemset I

How well did you know this?

Not at all

Perfectly

What is a frequent itemset?

Sets of items that appear in at least s baskets where s is the support threshold

How well did you know this?

Not at all

Perfectly

What is an association rule?

If-then rules about the contents of a basket
{i1, i2, …, ik} -> j means if a basket contains i1…ik it is likely to contain j

How well did you know this?

Not at all

Perfectly

How do we calculate the confidence of an association rule?

conf(I -> j) = support(I U j) / support(I)

How well did you know this?

Not at all

Perfectly

How do we calculate the interest of an association rule and when do we say a rule is interesting?

Interest(I -> j) = conf(I -> j) - Pr[j]
Interesting rules have high positive or negative interest values (usually above 0.5)

How well did you know this?

Not at all

Perfectly

What is the process for mining association rules?

Find all frequent itemsets I
For every subset A of I generate rule A -> I \ A
Output rules above the confidence threshold

The rationale behind this is that since I is frequent, A is also frequent

How well did you know this?

Not at all

Perfectly

What is the bottleneck for finding frequent itemsets?

Main memory is the bottleneck.
We need to count something within the baskets but the number of different things we can count is limited by main memory

How well did you know this?

Not at all

Perfectly

What is the hardest problem for finding frequent itemsets and why?

Finding frequent pairs is hardest because frequent pairs are common and we don’t have enough memory to store them all. The probability of being frequent drops exponentially with size.

How well did you know this?

Not at all

Perfectly

What is the naive algorithm for finding frequent pairs?

Read the file once, counting in main memory the occurrences of each pair.

From each basket of n items we generate its n(n-1)/2 pairs by two nested loops

How well did you know this?

Not at all

Perfectly

What is the issue with the naive algorithm for finding frequent pairs?

It fails if the number of items squared exceeds main memory

How well did you know this?

Not at all

Perfectly

What is the triangular matrix approach to counting frequent pairs?

We count a pair of items {i, j} if i < j
Keep pair counts in lexicographic order

How well did you know this?

Not at all

Perfectly

What is the triples approach to counting frequent pairs?

We keep a table of triples [i, j, c] where the count of the pair {i, j} is c

How well did you know this?

Not at all

Perfectly

How much memory is used by the triangular matrix and triples approach for counting frequent pairs?

Triangular matrix uses 4 bytes per pair
Triples uses 12 bytes per occurring pair but only for pairs with count > 0

How well did you know this?

Not at all

Perfectly

When is the triples approach for counting frequent pairs better than triangular matrix?

Study These Flashcards

It is better if less than 1/3 of possible pairs actually occurs

What is the issue with the triples and triangular matrix approaches to counting frequent pairs?

Study These Flashcards

We may not have enough memory to fit the pairs. Swapping counts in and out of disk is not a good solution

What are the 2 key ideas that they A-Priori algorithm is based on?

Study These Flashcards

Monotonicity: If a set of items I appears at least s times so does every subset of I
Contrapositive for pairs: If item i does not appear in s baskets then no pair including i can appear in s baskets

What happens during the first pass of the A-Priori algorithm?

Study These Flashcards

Read baskets and count in main memory the occurrences of each individual item. Any items that appear more than s times are frequent

What happens during the second pass of the A-Priori algorithm?

Study These Flashcards

Read baskets again and count in main memory only the pairs where both elements are frequent

What memory is required for each pass of the A-Priori algorithm?

Study These Flashcards

Pass 1: Memory proportional to number of items
Pass 2: Memory proportional to square of frequent items only plus a list of frequent items so you know what needs to be counted

How do we generalize A-Priori to work for frequent triples and above?

Study These Flashcards

For each k we construct 2 sets of k-tuples
Ck are the candidate k-tuples that may be frequent sets based on information from the pass for k-1
Lk contains the set of truly frequent k-tuples
We start with all the items
Count the items to get the set of frequent items
Construct a new set of potential 2-tuples using results from above
Repeat

What is the issue with the A-Priori algorithm?

Study These Flashcards

In pass 1, most memory is idle. We are only storing individual item counts

What do we do in the first pass of the PCY algorithm that is different from A-Priori?

Study These Flashcards

We maintain a hash table with as many buckets as fit in memory in addition to item counts. We keep a count for each bucket into which pairs of items are hashed. We do not record the pair, just the count that hash to the bucket

What does the count of a PCY bucket tell us?

If total count is less than s, none of its pairs can be frequent. If a bucket contains a frequent pair then the bucket is frequent

What do we do during the second pass of the PCY algorithm?

Only count pairs that hash to buckets found to be frequent

What memory is required for each pass of the PCY algorithm?

Pass 1: Memory for item counts and a hash table for pairs Pass 2: Memory for frequent items, buckets represented as a bitmap, and counts of candidate pairs

6. Frequent Itemsets Flashcards

(28 cards)