Association Analysis Flashcards

1
Q

Definition of itemset and its support.

A

Dataset T = {t1,t2, . . . ,tN } of N transactions (i.e., baskets) over a set I
of d items, with ti ⊆ I, for 1 ≤ i ≤ N.

An itemset is a subset X ⊆ I and its support w.r.t. T, denoted by
SuppT(X), is the fraction of transactions of T that contain X (i.e.,
|Tx| / N, where Tx ⊆ T is be the subset of transactions that contain X).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Definition of association rule.

A

An association rule is a rule r : X → Y, with X, Y ⊂ I, X, Y ̸= ∅, and X ∩ Y = ∅.
Its support and confidence w.r.t. T, denoted by SuppT(r) and ConfT(r), respectively, are defined as
• SuppT(r) = SuppT(X ∪ Y )
• ConfT(r) = SuppT(X ∪ Y )/SuppT(X).
X and Y are called, respectively, the rule’s antecedent and consequent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For set I of d items:

1) what is the number of possible distinct non-empty itemsets?
2) what is the number of distinct association rules?

A

1) 2^d - 1

2) 3^d -2^(d+1) +1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Definition of the “Anti-monotonicity of support” property.

A

For every X, Y ⊆ I, X ⊆ Y ⇒ SuppT (X) ≥ SuppT (Y).

This can easily be seen through the Hasse diagram.

Important consequences: for a given support threshold

1) X is frequent ⇒ every W ⊆ X is frequent (downward closure)
2) X is not frequent ⇒ every W ⊇ X is not frequent (note that W contains X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pseudocode of the A-priori algorithm that computes the set of frequent items from a dataset T of transactions.

A
A-PRIORI:
k ← 1
Compute F1 = {i ∈ I ; Supp({i}) ≥ minsup}
Compute O1 = {(X, Supp(X)) ; X ∈ F1}
repeat 
\_\_\_k ← k + 1
\_\_\_Ck ← apriori-gen(Fk−1) /* Candidates */
\_\_\_for each c ∈ Ck do σ(c) ← 0
\_\_\_for each t ∈ T do
\_\_\_\_\_\_for each c ∈ Ck do
\_\_\_\_\_\_\_\_\_if c ⊆ t then σ(c) ← σ(c) + 1
\_\_\_Fk ← {c ∈ Ck : σ(c) ≥ N · minsup};
\_\_\_Ok ← {(X, σ(X)/N) ; X ∈ Fk }
until Fk = ∅
return Union of all Ok

APRIORI-GEN:
Let ℓ − 1 be the size of each itemset in F
Φ ← ∅
/* Candidate Generation /
for each X, Y ∈ F s.t. X ̸= Y ∧ X[1 . . . ℓ − 2] = Y [1 . . . ℓ − 2] do:
___add X ∪ Y to Φ
/
Candidate Pruning */
for each Z ∈ Φ do
___for each Y ⊂ Z s.t. |Y| = ℓ − 1 do:
______if (Y ̸∈ F) then {remove Z from Φ; exit inner loop}
return Φ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Prove the correctness of the A-priori algorithm.

A
  • Study the proof on the slides!!
  • for l = 1 true (brute force)
  • for l true (inductive hyphotesis)
  • for l+1 consider x = x’ U x’’
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Efficiency of A-Priori.

A

For dataset T over d items and a threshold minsup, let M be the number of frequent itemsets returned by A-Priori.
Then, the algorithm computes the support for ≤ d + min{M^2, dM} itemsets.

  • *Study the proof on the slides!**
  • |C1| = d (ez)
  • |Ci| can be upperbounded in two ways
  • Sum over all |Ci|
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Association analysis: approaches to handle large inputs for computing frequent items.

A

1) Partition-based approach:
• Partition T into ℓ subsets T1,T2, . . . , Tℓ
• For each i separately, determine the set FTi of frequent itemsets w.r.t. Ti and minsup.
• Compute the support (w.r.t. T) for the itemsets in ∪ i=1,ℓ FTi, returning those of support ≥ minsup.

2) Sampling approach:
• Compute the frequent itemsets from a small sample of T.

Remarks:
• The partitioning approach computes the exact output but may be inefficient.
• The sampling approach computes an approximation but is more efficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Show that FT ⊆ ∪ ℓi=1 FTi (correcteness of the partition-based approach for computing the frequent items).

A
  • FT = the set of frequent itemsets w.r.t. T,
  • FTi = the set of frequent itemsets w.r.t. Ti

Solution:

  • x ∈ FT => |Tx|/l >= minsup * N/l
  • Pigeon principle => exists i s.t. |Tix| >= minsup * N/l => x ∈ FTi => x ∈ ∪ ℓi=1 FTi
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sampling-base approach: definition of Approximate frequent itemsets.

A

Consider a dataset T of transactions over the set of items I, a support threshold minsup ∈ (0, 1], and a parameter ϵ > 0.
A set C of pairs (X,sX), with X ⊆ I and sx ∈ (0, 1], is an ϵ-approximation of the set FT, minsup of frequent itemsets and their supports if the following conditions are satisfied:
1) For each X ∈ F(T,minsup) there exists a pair (X,sX) ∈ C
2) For each (X,sX) ∈ C,
____• SuppT (X) ≥ minsup − ϵ
____• |SuppT (X) − sX | ≤ ϵ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sampling-base approach: general algorithm.

A

1) Fix a suitably lower threshold f (minsup) < minsup.
2) Draw a random sample S from T with uniform probability and with replacement.
3) Return the set of pairs C = {(X, sx = SuppS(X)) : X ∈ FS,f(minsup)}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly