Association Analysis Flashcards
Which application areas can benefit from detecting
co-occurrence relationships?
Marketing: Identify items that are bought together for marketing purposes
Inventory Management: Identify parts that are often needed together for repairs to equip the repair vehicle
Usage Mining: Identify words that frequently appear together in search queries to offer auto-completion
Describe correlation analysis and give an example which technique can be used for continuous variables and binary variables?
- Measure the degree of dependency between two variables
Continuous variable: Pearsons correlation coefficient
Binary variable: Phi coefficient
Value range:
1: positive correlation
0: independent variable
- 1: negative correlation
What is the shortcoming with correlations between products in shopping baskets?
- Correlation can find relationships of items only between two items but not between multiple items
What is the benefit of association analysis compared to correlations?
- Association analysis can find multiple item
co-occurrence relationships (descriptive method) - focuses on occurring items
What can association analysis not find?
- Causal relationships
What is a itemset?
- collection of one or more items (e.g. in a transaction)
- k-itemset: An itemset that contains k items
Define Support count
- frequency of occurrence of an itemset
Define Support
- fraction of transaction that contain an itemset
Define frequent itemset
- an itemset whose support is >= minimal support (minsup)
What is the difference between the rule evaluation metrics Support and Confidence?
X (Condition) -> Y (Consequent)
Support:
- fraction of transactions that contain both X and Y
Confidence:
- how often items in Y appear in transaction that contain X
What are the main challenges of association analysis?
1) Mining associations from large amounts of data can be computationally expensive (need to apply smart pruning strategies)
2) Algorithms often discover a large number of associations (many are irrelevant or redundant, user needs to select the relevant subset)
What is the goal of association rule mining?
- Find all rules that have support >= the minsup threshold and confidence >= the minconf threshold
Explain the Brute Force Approach for Association Rule mining
1) List all possible association rules
2) compute support and confidence for each rule
3) remove rules that fail the threshold of minsup and minconf
Attention: Computationally prohibitive due to large number of candidates!
What happens with rules that originate from the same itemset?
- The rules have the same support but can have different confidence
Explain the two-step approach of rule generation; is this approach computationally better compared to the brute force approach?
1) Frequent itemset generation (all itemsets whose support >= minsup)
2) Rule generation (high confidence rules from each frequent itemset, each rule is a binary partitioning of a frequent itemset)
-> Frequent itemset generation is still computationally expensive
What is the difference between a frequent itemset and a candidate itemset?
- Frequent itemset has support >= minsup
- Candidate itemset is potentially a frequent itemset (only if support is >= minsup)
Explain the Brute Force Approach
- Each itemset is a candidate frequent itemset
- Count the support of each candidate by scanning the database (match each transaction against every candidate)
Complexity is O(NMw)
-> expensive because M = 2^d (d= items)
Explain why the brute force approach is not feasible (with the example of amazon)
- Amazon has 10 million books
- 2^10.000.000 possible itemsets
- > Most itemsets will not be relevant (books on inuit cooking and data mining bought together)
- > Intuition for the algorithm is necessary: All itemsets containing inuit cooking are likely infrequent
Explain the Apriori principle in regard to frequent itemsets
If an itemset is frequent, then all of its subsets must also be frequent
What is the anti-monotone property of support?
- Support of an itemset never exceeds the support of its subsets
How can you use the apriori principle for pruning?
- If an itemset is infrequent then all of its supersets must also be infrequent
- Prune supersets if an itemset that is infrequent
Explain the Apriori Algorithm for frequent itemset generation
- K=1
- Generate frequent itemsets of length 1
- Repeat till no new frequent itemsets are identified
- 1 Generate length k+1 candidate itemsets from length k frequent itemsets
- 2 Prune candidate itemsets according to apriori principle
- 3 Count the support of each candidate (scan the DB)
- 4 Elminate candidates that are infrequent
Briefly state what the FP-Growth frequent itemset generation method is
A itemset generation algorithm that compresses data into tree structure in memory
How can you apply the frequent itemsets in e commerce?
- Take the top-k frequent itemsets of size 2 that contain item A
- Rank second item according to (profit made by selling item, if you want to reduce stock of item B, knowledge about customer preferences)
- Offer special price for combination with top-ranked second item