Association Analysis Flashcards
Which application areas can benefit from detecting
co-occurrence relationships?
Marketing: Identify items that are bought together for marketing purposes
Inventory Management: Identify parts that are often needed together for repairs to equip the repair vehicle
Usage Mining: Identify words that frequently appear together in search queries to offer auto-completion
Describe correlation analysis and give an example which technique can be used for continuous variables and binary variables?
- Measure the degree of dependency between two variables
Continuous variable: Pearsons correlation coefficient
Binary variable: Phi coefficient
Value range:
1: positive correlation
0: independent variable
- 1: negative correlation
What is the shortcoming with correlations between products in shopping baskets?
- Correlation can find relationships of items only between two items but not between multiple items
What is the benefit of association analysis compared to correlations?
- Association analysis can find multiple item
co-occurrence relationships (descriptive method) - focuses on occurring items
What can association analysis not find?
- Causal relationships
What is a itemset?
- collection of one or more items (e.g. in a transaction)
- k-itemset: An itemset that contains k items
Define Support count
- frequency of occurrence of an itemset
Define Support
- fraction of transaction that contain an itemset
Define frequent itemset
- an itemset whose support is >= minimal support (minsup)
What is the difference between the rule evaluation metrics Support and Confidence?
X (Condition) -> Y (Consequent)
Support:
- fraction of transactions that contain both X and Y
Confidence:
- how often items in Y appear in transaction that contain X
What are the main challenges of association analysis?
1) Mining associations from large amounts of data can be computationally expensive (need to apply smart pruning strategies)
2) Algorithms often discover a large number of associations (many are irrelevant or redundant, user needs to select the relevant subset)
What is the goal of association rule mining?
- Find all rules that have support >= the minsup threshold and confidence >= the minconf threshold
Explain the Brute Force Approach for Association Rule mining
1) List all possible association rules
2) compute support and confidence for each rule
3) remove rules that fail the threshold of minsup and minconf
Attention: Computationally prohibitive due to large number of candidates!
What happens with rules that originate from the same itemset?
- The rules have the same support but can have different confidence
Explain the two-step approach of rule generation; is this approach computationally better compared to the brute force approach?
1) Frequent itemset generation (all itemsets whose support >= minsup)
2) Rule generation (high confidence rules from each frequent itemset, each rule is a binary partitioning of a frequent itemset)
-> Frequent itemset generation is still computationally expensive