Association Analysis Flashcards

1
Q

Which application areas can benefit from detecting

co-occurrence relationships?

A

Marketing: Identify items that are bought together for marketing purposes

Inventory Management: Identify parts that are often needed together for repairs to equip the repair vehicle

Usage Mining: Identify words that frequently appear together in search queries to offer auto-completion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe correlation analysis and give an example which technique can be used for continuous variables and binary variables?

A
  • Measure the degree of dependency between two variables

Continuous variable: Pearsons correlation coefficient
Binary variable: Phi coefficient

Value range:

1: positive correlation
0: independent variable
- 1: negative correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the shortcoming with correlations between products in shopping baskets?

A
  • Correlation can find relationships of items only between two items but not between multiple items
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the benefit of association analysis compared to correlations?

A
  • Association analysis can find multiple item
    co-occurrence relationships (descriptive method)
  • focuses on occurring items
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What can association analysis not find?

A
  • Causal relationships
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a itemset?

A
  • collection of one or more items (e.g. in a transaction)

- k-itemset: An itemset that contains k items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define Support count

A
  • frequency of occurrence of an itemset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define Support

A
  • fraction of transaction that contain an itemset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define frequent itemset

A
  • an itemset whose support is >= minimal support (minsup)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference between the rule evaluation metrics Support and Confidence?

A

X (Condition) -> Y (Consequent)
Support:
- fraction of transactions that contain both X and Y

Confidence:
- how often items in Y appear in transaction that contain X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the main challenges of association analysis?

A

1) Mining associations from large amounts of data can be computationally expensive (need to apply smart pruning strategies)
2) Algorithms often discover a large number of associations (many are irrelevant or redundant, user needs to select the relevant subset)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the goal of association rule mining?

A
  • Find all rules that have support >= the minsup threshold and confidence >= the minconf threshold
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the Brute Force Approach for Association Rule mining

A

1) List all possible association rules
2) compute support and confidence for each rule
3) remove rules that fail the threshold of minsup and minconf

Attention: Computationally prohibitive due to large number of candidates!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What happens with rules that originate from the same itemset?

A
  • The rules have the same support but can have different confidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the two-step approach of rule generation; is this approach computationally better compared to the brute force approach?

A

1) Frequent itemset generation (all itemsets whose support >= minsup)
2) Rule generation (high confidence rules from each frequent itemset, each rule is a binary partitioning of a frequent itemset)

-> Frequent itemset generation is still computationally expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between a frequent itemset and a candidate itemset?

A
  • Frequent itemset has support >= minsup

- Candidate itemset is potentially a frequent itemset (only if support is >= minsup)

17
Q

Explain the Brute Force Approach

A
  • Each itemset is a candidate frequent itemset
  • Count the support of each candidate by scanning the database (match each transaction against every candidate)

Complexity is O(NMw)
-> expensive because M = 2^d (d= items)

18
Q

Explain why the brute force approach is not feasible (with the example of amazon)

A
  • Amazon has 10 million books
  • 2^10.000.000 possible itemsets
  • > Most itemsets will not be relevant (books on inuit cooking and data mining bought together)
  • > Intuition for the algorithm is necessary: All itemsets containing inuit cooking are likely infrequent
19
Q

Explain the Apriori principle in regard to frequent itemsets

A

If an itemset is frequent, then all of its subsets must also be frequent

20
Q

What is the anti-monotone property of support?

A
  • Support of an itemset never exceeds the support of its subsets
21
Q

How can you use the apriori principle for pruning?

A
  • If an itemset is infrequent then all of its supersets must also be infrequent
  • Prune supersets if an itemset that is infrequent
22
Q

Explain the Apriori Algorithm for frequent itemset generation

A
  1. K=1
  2. Generate frequent itemsets of length 1
  3. Repeat till no new frequent itemsets are identified
  4. 1 Generate length k+1 candidate itemsets from length k frequent itemsets
  5. 2 Prune candidate itemsets according to apriori principle
  6. 3 Count the support of each candidate (scan the DB)
  7. 4 Elminate candidates that are infrequent
23
Q

Briefly state what the FP-Growth frequent itemset generation method is

A

A itemset generation algorithm that compresses data into tree structure in memory

24
Q

How can you apply the frequent itemsets in e commerce?

A
  • Take the top-k frequent itemsets of size 2 that contain item A
  • Rank second item according to (profit made by selling item, if you want to reduce stock of item B, knowledge about customer preferences)
  • Offer special price for combination with top-ranked second item
25
Briefly explain how rules a generated
- Given a frequent itemset, find all subsets that satisfy the minimum confidence requirement
26
Explain the challenge with large number of candidate rules
- Frequent Itemset {A,B,C,D} - Candidate association rules: (2^k) - 2 - -> Large Number of candidate rules for large frequent itemsets - > (ignores the association rules that are empty on the condition and consequent side)
27
Explain Anti-monotone confidence
Anti-monotone confidence = Moving items from the condition to the consequent side can't increase confidence - Generally, confidence does not have an anti-monotone property (confidence of subset can be larger or smaller than confidence of frequent itemset) - Confidence of rules generated from the same itemset have an anti monotone property (If a customer bought 3 books its likely he will buy the fourth book; if he only bought 1 book its unlikely that he will buy the three other books from the frequent itemset
28
How can you efficiently prune candidate rules?
- By using the anti-monotone property of confidence we know that moving elements from the left to right cannot increase the confidence - If a rule with many items on the left side and one item on the right side is a low confidence rule, you can prune all rules that are a sub rule of that rule
29
How can you generate candidate rules within the apriori algorithm?
1) Merge two rules that share the same prefix in the rule consequent (CD->AB and BD-> AC results in D->ABC) 2) Prune rule D->ABC if one of its parent rules does not have high confidence (e.g. AD->BC) Possible because: - all required information for confidence computation has been created in itemset generation - Thus, no need to scan the transaction data T any more
30
How can you handle categorical attributes in association analysis?
- Transform categorical attributes into asymmetric binary values (new "item" for each distinct attribute value pair) attributes: color = green, color = blue, ....
31
What is the issue with the transformation of categorical attributes into asymmetric binary variables?
- If a attribute has many possible values, many of them will have very low support potential solution: aggregate low-support attribute values - If the distribution of attribute values is highly skewed drop the highly frequent item (e.g. 95% of visitors have Buy = no so most of items will be associated with Buy = No)
32
How can you handle continuous attributes for association analysis?
- Transform continuous attributes into binary variables with discretization (equal-width / equal-frequency binning)
33
What is the issue with the discretization of continuous attributes?
- Size of the discretization intervals affects support & confidence - If intervals are too small -> may not have enough support - If intervals are too large -> rules may not have enough confidence
34
Why do you need further interestingness measures to improve the results of an association rule?
- Association rule algorithms tend to produce too many rules (many are uninteresting or redundant) - Interestingness depends on application area - Interestingness measures can be used to prune or rank rules (Originally, only support & confidence were the only interestingness measures)
35
In which case can confidence be misleading to get good rules?
- Confidence of a rule can be high - But support of the consequent can be higher - > Rule is then misleading because the condition reduces probability of > We want c(X->Y) > support (>)
36
What is the lift measure?
Lift = c(X->Y) / s(Y) - Confidence normalized by support of consequent Interpretation: - Lift > 1 then X and > are positively correlated - Lift = 1 then X and Y are independent - Lift < 1 then X and > are negatively correlated
37
For which task does the human always account for when doing association analysis?
- Human needs to interpret the results based on the knowledge about the application domain - Which patterns are meaninungful / surprising?