Association Analysis Flashcards

1
Q

Which application areas can benefit from detecting

co-occurrence relationships?

A

Marketing: Identify items that are bought together for marketing purposes

Inventory Management: Identify parts that are often needed together for repairs to equip the repair vehicle

Usage Mining: Identify words that frequently appear together in search queries to offer auto-completion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe correlation analysis and give an example which technique can be used for continuous variables and binary variables?

A
  • Measure the degree of dependency between two variables

Continuous variable: Pearsons correlation coefficient
Binary variable: Phi coefficient

Value range:

1: positive correlation
0: independent variable
- 1: negative correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the shortcoming with correlations between products in shopping baskets?

A
  • Correlation can find relationships of items only between two items but not between multiple items
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the benefit of association analysis compared to correlations?

A
  • Association analysis can find multiple item
    co-occurrence relationships (descriptive method)
  • focuses on occurring items
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What can association analysis not find?

A
  • Causal relationships
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a itemset?

A
  • collection of one or more items (e.g. in a transaction)

- k-itemset: An itemset that contains k items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define Support count

A
  • frequency of occurrence of an itemset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define Support

A
  • fraction of transaction that contain an itemset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define frequent itemset

A
  • an itemset whose support is >= minimal support (minsup)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference between the rule evaluation metrics Support and Confidence?

A

X (Condition) -> Y (Consequent)
Support:
- fraction of transactions that contain both X and Y

Confidence:
- how often items in Y appear in transaction that contain X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the main challenges of association analysis?

A

1) Mining associations from large amounts of data can be computationally expensive (need to apply smart pruning strategies)
2) Algorithms often discover a large number of associations (many are irrelevant or redundant, user needs to select the relevant subset)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the goal of association rule mining?

A
  • Find all rules that have support >= the minsup threshold and confidence >= the minconf threshold
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the Brute Force Approach for Association Rule mining

A

1) List all possible association rules
2) compute support and confidence for each rule
3) remove rules that fail the threshold of minsup and minconf

Attention: Computationally prohibitive due to large number of candidates!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What happens with rules that originate from the same itemset?

A
  • The rules have the same support but can have different confidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the two-step approach of rule generation; is this approach computationally better compared to the brute force approach?

A

1) Frequent itemset generation (all itemsets whose support >= minsup)
2) Rule generation (high confidence rules from each frequent itemset, each rule is a binary partitioning of a frequent itemset)

-> Frequent itemset generation is still computationally expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between a frequent itemset and a candidate itemset?

A
  • Frequent itemset has support >= minsup

- Candidate itemset is potentially a frequent itemset (only if support is >= minsup)

17
Q

Explain the Brute Force Approach

A
  • Each itemset is a candidate frequent itemset
  • Count the support of each candidate by scanning the database (match each transaction against every candidate)

Complexity is O(NMw)
-> expensive because M = 2^d (d= items)

18
Q

Explain why the brute force approach is not feasible (with the example of amazon)

A
  • Amazon has 10 million books
  • 2^10.000.000 possible itemsets
  • > Most itemsets will not be relevant (books on inuit cooking and data mining bought together)
  • > Intuition for the algorithm is necessary: All itemsets containing inuit cooking are likely infrequent
19
Q

Explain the Apriori principle in regard to frequent itemsets

A

If an itemset is frequent, then all of its subsets must also be frequent

20
Q

What is the anti-monotone property of support?

A
  • Support of an itemset never exceeds the support of its subsets
21
Q

How can you use the apriori principle for pruning?

A
  • If an itemset is infrequent then all of its supersets must also be infrequent
  • Prune supersets if an itemset that is infrequent
22
Q

Explain the Apriori Algorithm for frequent itemset generation

A
  1. K=1
  2. Generate frequent itemsets of length 1
  3. Repeat till no new frequent itemsets are identified
  4. 1 Generate length k+1 candidate itemsets from length k frequent itemsets
  5. 2 Prune candidate itemsets according to apriori principle
  6. 3 Count the support of each candidate (scan the DB)
  7. 4 Elminate candidates that are infrequent
23
Q

Briefly state what the FP-Growth frequent itemset generation method is

A

A itemset generation algorithm that compresses data into tree structure in memory

24
Q

How can you apply the frequent itemsets in e commerce?

A
  • Take the top-k frequent itemsets of size 2 that contain item A
  • Rank second item according to (profit made by selling item, if you want to reduce stock of item B, knowledge about customer preferences)
  • Offer special price for combination with top-ranked second item
25
Q

Briefly explain how rules a generated

A
  • Given a frequent itemset, find all subsets that satisfy the minimum confidence requirement
26
Q

Explain the challenge with large number of candidate rules

A
  • Frequent Itemset {A,B,C,D}
  • Candidate association rules: (2^k) - 2
  • -> Large Number of candidate rules for large frequent itemsets
  • > (ignores the association rules that are empty on the condition and consequent side)
27
Q

Explain Anti-monotone confidence

A

Anti-monotone confidence = Moving items from the condition to the consequent side can’t increase confidence

  • Generally, confidence does not have an anti-monotone property (confidence of subset can be larger or smaller than confidence of frequent itemset)
  • Confidence of rules generated from the same itemset have an anti monotone property

(If a customer bought 3 books its likely he will buy the fourth book; if he only bought 1 book its unlikely that he will buy the three other books from the frequent itemset

28
Q

How can you efficiently prune candidate rules?

A
  • By using the anti-monotone property of confidence we know that moving elements from the left to right cannot increase the confidence
  • If a rule with many items on the left side and one item on the right side is a low confidence rule, you can prune all rules that are a sub rule of that rule
29
Q

How can you generate candidate rules within the apriori algorithm?

A

1) Merge two rules that share the same prefix in the rule consequent (CD->AB and BD-> AC results in D->ABC)
2) Prune rule D->ABC if one of its parent rules does not have high confidence (e.g. AD->BC)

Possible because:

  • all required information for confidence computation has been created in itemset generation
  • Thus, no need to scan the transaction data T any more
30
Q

How can you handle categorical attributes in association analysis?

A
  • Transform categorical attributes into asymmetric binary values (new “item” for each distinct attribute value pair) attributes: color = green, color = blue, ….
31
Q

What is the issue with the transformation of categorical attributes into asymmetric binary variables?

A
  • If a attribute has many possible values, many of them will have very low support
    potential solution: aggregate low-support attribute values
  • If the distribution of attribute values is highly skewed drop the highly frequent item (e.g. 95% of visitors have Buy = no so most of items will be associated with Buy = No)
32
Q

How can you handle continuous attributes for association analysis?

A
  • Transform continuous attributes into binary variables with discretization (equal-width / equal-frequency binning)
33
Q

What is the issue with the discretization of continuous attributes?

A
  • Size of the discretization intervals affects support & confidence
  • If intervals are too small -> may not have enough support
  • If intervals are too large -> rules may not have enough confidence
34
Q

Why do you need further interestingness measures to improve the results of an association rule?

A
  • Association rule algorithms tend to produce too many rules (many are uninteresting or redundant)
  • Interestingness depends on application area
  • Interestingness measures can be used to prune or rank rules

(Originally, only support & confidence were the only interestingness measures)

35
Q

In which case can confidence be misleading to get good rules?

A
  • Confidence of a rule can be high
  • But support of the consequent can be higher
  • > Rule is then misleading because the condition reduces probability of >

We want c(X->Y) > support (>)

36
Q

What is the lift measure?

A

Lift = c(X->Y) / s(Y)

  • Confidence normalized by support of consequent

Interpretation:

  • Lift > 1 then X and > are positively correlated
  • Lift = 1 then X and Y are independent
  • Lift < 1 then X and > are negatively correlated
37
Q

For which task does the human always account for when doing association analysis?

A
  • Human needs to interpret the results based on the knowledge about the application domain
  • Which patterns are meaninungful / surprising?