- collection of one or more items (e.g. in a transaction) - k-itemset: An itemset that contains k items

- frequency of occurrence of an itemset

- fraction of transaction that contain an itemset

Association Analysis Flashcards by Jan Kiefer

Which application areas can benefit from detecting

co-occurrence relationships?

Marketing: Identify items that are bought together for marketing purposes

Inventory Management: Identify parts that are often needed together for repairs to equip the repair vehicle

Usage Mining: Identify words that frequently appear together in search queries to offer auto-completion

How well did you know this?

Not at all

Perfectly

Describe correlation analysis and give an example which technique can be used for continuous variables and binary variables?

Measure the degree of dependency between two variables

Continuous variable: Pearsons correlation coefficient
Binary variable: Phi coefficient

Value range:

1: positive correlation
0: independent variable
- 1: negative correlation

How well did you know this?

Not at all

Perfectly

What is the shortcoming with correlations between products in shopping baskets?

Correlation can find relationships of items only between two items but not between multiple items

How well did you know this?

Not at all

Perfectly

What is the benefit of association analysis compared to correlations?

Association analysis can find multiple item
co-occurrence relationships (descriptive method)
focuses on occurring items

How well did you know this?

Not at all

Perfectly

What can association analysis not find?

Causal relationships

How well did you know this?

Not at all

Perfectly

What is a itemset?

collection of one or more items (e.g. in a transaction)

- k-itemset: An itemset that contains k items

How well did you know this?

Not at all

Perfectly

Define Support count

frequency of occurrence of an itemset

How well did you know this?

Not at all

Perfectly

Define Support

fraction of transaction that contain an itemset

How well did you know this?

Not at all

Perfectly

Define frequent itemset

an itemset whose support is >= minimal support (minsup)

How well did you know this?

Not at all

Perfectly

What is the difference between the rule evaluation metrics Support and Confidence?

X (Condition) -> Y (Consequent)
Support:
- fraction of transactions that contain both X and Y

Confidence:
- how often items in Y appear in transaction that contain X

How well did you know this?

Not at all

Perfectly

What are the main challenges of association analysis?

1) Mining associations from large amounts of data can be computationally expensive (need to apply smart pruning strategies)
2) Algorithms often discover a large number of associations (many are irrelevant or redundant, user needs to select the relevant subset)

How well did you know this?

Not at all

Perfectly

What is the goal of association rule mining?

Find all rules that have support >= the minsup threshold and confidence >= the minconf threshold

How well did you know this?

Not at all

Perfectly

Explain the Brute Force Approach for Association Rule mining

1) List all possible association rules
2) compute support and confidence for each rule
3) remove rules that fail the threshold of minsup and minconf

Attention: Computationally prohibitive due to large number of candidates!

How well did you know this?

Not at all

Perfectly

What happens with rules that originate from the same itemset?

The rules have the same support but can have different confidence

How well did you know this?

Not at all

Perfectly

Explain the two-step approach of rule generation; is this approach computationally better compared to the brute force approach?

1) Frequent itemset generation (all itemsets whose support >= minsup)
2) Rule generation (high confidence rules from each frequent itemset, each rule is a binary partitioning of a frequent itemset)

-> Frequent itemset generation is still computationally expensive

How well did you know this?

Not at all

Perfectly

What is the difference between a frequent itemset and a candidate itemset?

Study These Flashcards

Frequent itemset has support >= minsup

- Candidate itemset is potentially a frequent itemset (only if support is >= minsup)

Explain the Brute Force Approach

Study These Flashcards

Each itemset is a candidate frequent itemset
Count the support of each candidate by scanning the database (match each transaction against every candidate)

Complexity is O(NMw)
-> expensive because M = 2^d (d= items)

Explain why the brute force approach is not feasible (with the example of amazon)

Study These Flashcards

Amazon has 10 million books
2^10.000.000 possible itemsets
> Most itemsets will not be relevant (books on inuit cooking and data mining bought together)
> Intuition for the algorithm is necessary: All itemsets containing inuit cooking are likely infrequent

Explain the Apriori principle in regard to frequent itemsets

Study These Flashcards

If an itemset is frequent, then all of its subsets must also be frequent

What is the anti-monotone property of support?

Study These Flashcards

Support of an itemset never exceeds the support of its subsets

How can you use the apriori principle for pruning?

Study These Flashcards

If an itemset is infrequent then all of its supersets must also be infrequent
Prune supersets if an itemset that is infrequent

Explain the Apriori Algorithm for frequent itemset generation

Study These Flashcards

K=1
Generate frequent itemsets of length 1
Repeat till no new frequent itemsets are identified
1 Generate length k+1 candidate itemsets from length k frequent itemsets
2 Prune candidate itemsets according to apriori principle
3 Count the support of each candidate (scan the DB)
4 Elminate candidates that are infrequent

Briefly state what the FP-Growth frequent itemset generation method is

Study These Flashcards

A itemset generation algorithm that compresses data into tree structure in memory

How can you apply the frequent itemsets in e commerce?

Study These Flashcards

Take the top-k frequent itemsets of size 2 that contain item A
Rank second item according to (profit made by selling item, if you want to reduce stock of item B, knowledge about customer preferences)
Offer special price for combination with top-ranked second item

Briefly explain how rules a generated

- Given a frequent itemset, find all subsets that satisfy the minimum confidence requirement

Explain the challenge with large number of candidate rules

- Frequent Itemset {A,B,C,D} - Candidate association rules: (2^k) - 2 - -> Large Number of candidate rules for large frequent itemsets - > (ignores the association rules that are empty on the condition and consequent side)

Explain Anti-monotone confidence

Anti-monotone confidence = Moving items from the condition to the consequent side can't increase confidence - Generally, confidence does not have an anti-monotone property (confidence of subset can be larger or smaller than confidence of frequent itemset) - Confidence of rules generated from the same itemset have an anti monotone property (If a customer bought 3 books its likely he will buy the fourth book; if he only bought 1 book its unlikely that he will buy the three other books from the frequent itemset

How can you efficiently prune candidate rules?

- By using the anti-monotone property of confidence we know that moving elements from the left to right cannot increase the confidence - If a rule with many items on the left side and one item on the right side is a low confidence rule, you can prune all rules that are a sub rule of that rule

How can you generate candidate rules within the apriori algorithm?

1) Merge two rules that share the same prefix in the rule consequent (CD->AB and BD-> AC results in D->ABC) 2) Prune rule D->ABC if one of its parent rules does not have high confidence (e.g. AD->BC) Possible because: - all required information for confidence computation has been created in itemset generation - Thus, no need to scan the transaction data T any more

How can you handle categorical attributes in association analysis?

- Transform categorical attributes into asymmetric binary values (new "item" for each distinct attribute value pair) attributes: color = green, color = blue, ....

What is the issue with the transformation of categorical attributes into asymmetric binary variables?

- If a attribute has many possible values, many of them will have very low support potential solution: aggregate low-support attribute values - If the distribution of attribute values is highly skewed drop the highly frequent item (e.g. 95% of visitors have Buy = no so most of items will be associated with Buy = No)

How can you handle continuous attributes for association analysis?

- Transform continuous attributes into binary variables with discretization (equal-width / equal-frequency binning)

What is the issue with the discretization of continuous attributes?

- Size of the discretization intervals affects support & confidence - If intervals are too small -> may not have enough support - If intervals are too large -> rules may not have enough confidence

Why do you need further interestingness measures to improve the results of an association rule?

- Association rule algorithms tend to produce too many rules (many are uninteresting or redundant) - Interestingness depends on application area - Interestingness measures can be used to prune or rank rules (Originally, only support & confidence were the only interestingness measures)

In which case can confidence be misleading to get good rules?

- Confidence of a rule can be high - But support of the consequent can be higher - > Rule is then misleading because the condition reduces probability of > We want c(X->Y) > support (>)

What is the lift measure?

Lift = c(X->Y) / s(Y) - Confidence normalized by support of consequent Interpretation: - Lift > 1 then X and > are positively correlated - Lift = 1 then X and Y are independent - Lift < 1 then X and > are negatively correlated

For which task does the human always account for when doing association analysis?

- Human needs to interpret the results based on the knowledge about the application domain - Which patterns are meaninungful / surprising?

Association Analysis Flashcards

(37 cards)