6. Frequent Itemsets Flashcards

1
Q

What is the market basket model?

A

Depicts data as a large set of items which may fall into any number of a large amount of baskets. Each basket is a small subset of items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the goal of the market basket model?

A

We want to find association rules.
Ex: People who bought {x, y, z} tend to buy {v, w}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is support?

A

The number of baskets containing all items in I for a given itemset I

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a frequent itemset?

A

Sets of items that appear in at least s baskets where s is the support threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is an association rule?

A

If-then rules about the contents of a basket
{i1, i2, …, ik} -> j means if a basket contains i1…ik it is likely to contain j

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we calculate the confidence of an association rule?

A

conf(I -> j) = support(I U j) / support(I)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we calculate the interest of an association rule and when do we say a rule is interesting?

A

Interest(I -> j) = conf(I -> j) - Pr[j]
Interesting rules have high positive or negative interest values (usually above 0.5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the process for mining association rules?

A
  1. Find all frequent itemsets I
  2. For every subset A of I generate rule A -> I \ A
  3. Output rules above the confidence threshold

The rationale behind this is that since I is frequent, A is also frequent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the bottleneck for finding frequent itemsets?

A

Main memory is the bottleneck.
We need to count something within the baskets but the number of different things we can count is limited by main memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the hardest problem for finding frequent itemsets and why?

A

Finding frequent pairs is hardest because frequent pairs are common and we don’t have enough memory to store them all. The probability of being frequent drops exponentially with size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the naive algorithm for finding frequent pairs?

A

Read the file once, counting in main memory the occurrences of each pair.

From each basket of n items we generate its n(n-1)/2 pairs by two nested loops

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the issue with the naive algorithm for finding frequent pairs?

A

It fails if the number of items squared exceeds main memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the triangular matrix approach to counting frequent pairs?

A

We count a pair of items {i, j} if i < j
Keep pair counts in lexicographic order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the triples approach to counting frequent pairs?

A

We keep a table of triples [i, j, c] where the count of the pair {i, j} is c

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How much memory is used by the triangular matrix and triples approach for counting frequent pairs?

A

Triangular matrix uses 4 bytes per pair
Triples uses 12 bytes per occurring pair but only for pairs with count > 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When is the triples approach for counting frequent pairs better than triangular matrix?

A

It is better if less than 1/3 of possible pairs actually occurs

17
Q

What is the issue with the triples and triangular matrix approaches to counting frequent pairs?

A

We may not have enough memory to fit the pairs. Swapping counts in and out of disk is not a good solution

18
Q

What are the 2 key ideas that they A-Priori algorithm is based on?

A
  1. Monotonicity: If a set of items I appears at least s times so does every subset of I
  2. Contrapositive for pairs: If item i does not appear in s baskets then no pair including i can appear in s baskets
19
Q

What happens during the first pass of the A-Priori algorithm?

A

Read baskets and count in main memory the occurrences of each individual item. Any items that appear more than s times are frequent

20
Q

What happens during the second pass of the A-Priori algorithm?

A

Read baskets again and count in main memory only the pairs where both elements are frequent

21
Q

What memory is required for each pass of the A-Priori algorithm?

A

Pass 1: Memory proportional to number of items
Pass 2: Memory proportional to square of frequent items only plus a list of frequent items so you know what needs to be counted

22
Q

How do we generalize A-Priori to work for frequent triples and above?

A
  1. For each k we construct 2 sets of k-tuples
  2. Ck are the candidate k-tuples that may be frequent sets based on information from the pass for k-1
  3. Lk contains the set of truly frequent k-tuples
  4. We start with all the items
  5. Count the items to get the set of frequent items
  6. Construct a new set of potential 2-tuples using results from above
  7. Repeat
23
Q

What is the issue with the A-Priori algorithm?

A

In pass 1, most memory is idle. We are only storing individual item counts

24
Q

What do we do in the first pass of the PCY algorithm that is different from A-Priori?

A

We maintain a hash table with as many buckets as fit in memory in addition to item counts. We keep a count for each bucket into which pairs of items are hashed. We do not record the pair, just the count that hash to the bucket

25
Q

What does the count of a PCY bucket tell us?

A

If total count is less than s, none of its pairs can be frequent.
If a bucket contains a frequent pair then the bucket is frequent

26
Q

What do we do during the second pass of the PCY algorithm?

A

Only count pairs that hash to buckets found to be frequent

27
Q

What memory is required for each pass of the PCY algorithm?

A

Pass 1: Memory for item counts and a hash table for pairs
Pass 2: Memory for frequent items, buckets represented as a bitmap, and counts of candidate pairs

28
Q
A