Data Mining: Association Rules Flashcards

Question

transactional formats

Answer 1

A transaction can be a set of: - Market basket data - Textual data - Structured data

Answer 2

Set including one or more items

Answer 3

Itemset with cardinality of k

Answer 4

Frequency of occurrence of itemset example #{Diapers, Beer} = 2

Answer 5

is the fraction of transactions that contain an itemset example sup({beer, diapers}) = 2/5

Answer 6

Is an itemset whose support is greater than or equal to a minsup threshold

Answer 7

Given A=\>B Support := #{A,B}/ |T| where |T| is the cardinality of the transactional database. Confidence := sup(A,B)/sup(A), conditional probability of finding B having found A.

Answer 8

Given a set of transaction, association rule mining is the extraction of rules satisfying the constraints: - support \>= minsup threshold - confidence \>= minconf threshold Result is - complete (all rules satisfying both contraints) - correct (only the rules satisfying both contraints)

Answer 9

Frequent itemsets - many different techniques - level-wise approaches (apriori) - without candidate generation (fp-growth) - other - this is the most computationally expensive step Extraction of association rules - generation of all possible binary partitioning of each frequent itemset - possibly enforcing a confidence threshold

Answer 10

If an itemset is frequent, then all of its subsets must also be frequent. - The support of an itemset can never exceed the support of any of its subsets - it reduces then number of candidates.

Answer 11

1. Candidate generation - Join step: generate candidates of len k+1 by joining frequent itemsets of len k - Prune step: apply apriori principle := prune len k+ 1 that contain at least on k-itemset that is not frequent. 2. Frequent itemset generation - ScanDB to count support for k+1 candidates - prune candidates below minsup Counting support of candidates: - candidate itemsets are stored in a hash-tree - subset function finds all candidates contain Performance issues: Candidate sets may be huge - 2-itemset generation is the most critical - extracting long frequent itemsets requires generating all frequent subsets Multiple database scans: n+1 scans when longest frequent pattern len is n Factors affecting performance: - min support threshold - dimensionality

Answer 12

Exploits compressed representation of the database = FP-TREE -\> high compression fo dense data distributions, complete representation for frequent pattern mining Frequent pattern mining: 1. recursive visit of fp-tree 2. applies divide and conquer approach Only 2 DB scans: cout item support + fp-tree build Fp-tree is just a Trie that counts how many childs a node has. Also each node with key k points to another node with key k and to the corresponding element on the header table. The header table counts the frequency of each single item. Alog: Scan header table from lowest support item up. For each item in header table extract freqeunt itemsets including item i and items preceding it in header table.

Answer 13

An itemset is closed if none of its immediatesupersets has the same support as the itemset

Answer 14

minsup too high: itemsets including rare but interesting items may be lost minsup is too low: too much data -\> computationally expensive.

Answer 15

Objective measure. Not always reliable, results are influenced by the cardinality of a set of data points.

Answer 16

Subjective measure: r: A =\> B correlation = P(A|B) / (P(A)P(B)) = conf(r) / sup(B) correlation = 1 =\> statistical independence correlation \> 1 =\> positive correlation correlation \< 1 =\> negative correlation.

Answer 17

Consider item/transaction weights during association rule estraction Extend rule quality measures -\> weighted support, weigheted confidence. Apply ad-hoc weight aggregation -\> min, max, avg.

Answer 18

Enable aggregation over attributes in a dataset... Typically user provided Example: time hierarchy, product category, location hierarchy.

Answer 19

Sets of item at different generalization levels. Generalized itemset covers a transaction when: - all its generalized items are ancestors of items included in the transaction - its data items are included in the transaction Generalized itemset support: - ratio between number of covered transactions and dataset cardinality.

Answer 20

High level -\> Only generalized itemsets (high level info) Cross level -\> generalized items and data items combined Low-level reles -\> only data itemsets.

Data Mining: Association Rules Flashcards

(47 cards)