Association Analysis Flashcards

Question 1

Q

Definition of itemset and its support.

Answer

A

Dataset T = {t1,t2, . . . ,tN } of N transactions (i.e., baskets) over a set I
of d items, with ti ⊆ I, for 1 ≤ i ≤ N.

An itemset is a subset X ⊆ I and its support w.r.t. T, denoted by
SuppT(X), is the fraction of transactions of T that contain X (i.e.,
|Tx| / N, where Tx ⊆ T is be the subset of transactions that contain X).

Question 2

Q

Definition of association rule.

Answer

A

An association rule is a rule r : X → Y, with X, Y ⊂ I, X, Y ̸= ∅, and X ∩ Y = ∅.
Its support and confidence w.r.t. T, denoted by SuppT(r) and ConfT(r), respectively, are defined as
• SuppT(r) = SuppT(X ∪ Y )
• ConfT(r) = SuppT(X ∪ Y )/SuppT(X).
X and Y are called, respectively, the rule’s antecedent and consequent.

Question 3

Q

For set I of d items:

1) what is the number of possible distinct non-empty itemsets?
2) what is the number of distinct association rules?

Answer

A

1) 2^d - 1

2) 3^d -2^(d+1) +1

Question 4

Q

Definition of the “Anti-monotonicity of support” property.

Answer

A

For every X, Y ⊆ I, X ⊆ Y ⇒ SuppT (X) ≥ SuppT (Y).

This can easily be seen through the Hasse diagram.

Important consequences: for a given support threshold

1) X is frequent ⇒ every W ⊆ X is frequent (downward closure)
2) X is not frequent ⇒ every W ⊇ X is not frequent (note that W contains X)

Question 5

Q

Pseudocode of the A-priori algorithm that computes the set of frequent items from a dataset T of transactions.

Answer

A

A-PRIORI:
k ← 1
Compute F1 = {i ∈ I ; Supp({i}) ≥ minsup}
Compute O1 = {(X, Supp(X)) ; X ∈ F1}
repeat 
\_\_\_k ← k + 1
\_\_\_Ck ← apriori-gen(Fk−1) /* Candidates */
\_\_\_for each c ∈ Ck do σ(c) ← 0
\_\_\_for each t ∈ T do
\_\_\_\_\_\_for each c ∈ Ck do
\_\_\_\_\_\_\_\_\_if c ⊆ t then σ(c) ← σ(c) + 1
\_\_\_Fk ← {c ∈ Ck : σ(c) ≥ N · minsup};
\_\_\_Ok ← {(X, σ(X)/N) ; X ∈ Fk }
until Fk = ∅
return Union of all Ok

APRIORI-GEN:
Let ℓ − 1 be the size of each itemset in F
Φ ← ∅
/* Candidate Generation /
for each X, Y ∈ F s.t. X ̸= Y ∧ X[1 . . . ℓ − 2] = Y [1 . . . ℓ − 2] do:
___add X ∪ Y to Φ
/ Candidate Pruning */
for each Z ∈ Φ do
___for each Y ⊂ Z s.t. |Y| = ℓ − 1 do:
______if (Y ̸∈ F) then {remove Z from Φ; exit inner loop}
return Φ

Question 6

Q

Prove the correctness of the A-priori algorithm.

Answer

A

Study the proof on the slides!!
for l = 1 true (brute force)
for l true (inductive hyphotesis)
for l+1 consider x = x’ U x’’

Question 7

Q

Efficiency of A-Priori.

Answer

A

For dataset T over d items and a threshold minsup, let M be the number of frequent itemsets returned by A-Priori.
Then, the algorithm computes the support for ≤ d + min{M^2, dM} itemsets.

*Study the proof on the slides!**
|C1| = d (ez)
|Ci| can be upperbounded in two ways
Sum over all |Ci|

Question 8

Q

Association analysis: approaches to handle large inputs for computing frequent items.

Answer

A

1) Partition-based approach:
• Partition T into ℓ subsets T1,T2, . . . , Tℓ
• For each i separately, determine the set FTi of frequent itemsets w.r.t. Ti and minsup.
• Compute the support (w.r.t. T) for the itemsets in ∪ i=1,ℓ FTi, returning those of support ≥ minsup.

2) Sampling approach:
• Compute the frequent itemsets from a small sample of T.

Remarks:
• The partitioning approach computes the exact output but may be inefficient.
• The sampling approach computes an approximation but is more efficient.

Question 9

Q

Show that FT ⊆ ∪ ℓi=1 FTi (correcteness of the partition-based approach for computing the frequent items).

Answer

A

FT = the set of frequent itemsets w.r.t. T,
FTi = the set of frequent itemsets w.r.t. Ti

Solution:

x ∈ FT => |Tx|/l >= minsup * N/l
Pigeon principle => exists i s.t. |Tix| >= minsup * N/l => x ∈ FTi => x ∈ ∪ ℓi=1 FTi

Question 10

Q

Sampling-base approach: definition of Approximate frequent itemsets.

Answer

A

Consider a dataset T of transactions over the set of items I, a support threshold minsup ∈ (0, 1], and a parameter ϵ > 0.
A set C of pairs (X,sX), with X ⊆ I and sx ∈ (0, 1], is an ϵ-approximation of the set FT, minsup of frequent itemsets and their supports if the following conditions are satisfied:
1) For each X ∈ F(T,minsup) there exists a pair (X,sX) ∈ C
2) For each (X,sX) ∈ C,
____• SuppT (X) ≥ minsup − ϵ
____• |SuppT (X) − sX | ≤ ϵ.

Question 11

Q

Sampling-base approach: general algorithm.

Answer

A

1) Fix a suitably lower threshold f (minsup) < minsup.
2) Draw a random sample S from T with uniform probability and with replacement.
3) Return the set of pairs C = {(X, sx = SuppS(X)) : X ∈ FS,f(minsup)}

Association Analysis Flashcards

(11 cards)