2. Association Rule Mining Flashcards
What is association rule mining?
It’s when, given a set of transactions, you find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.
Example of a rule:
{Milk, Bread} –> {Beer}
–> means “implies”
This rule is taken from a list of 5 transactions that contain milk, bread, beer, and other items.
What is an itemset?
A collection of one or more items
Ex: {Milk, Bread, Diaper}
What is the support count of an itemset?
The count of how many times that itemset appears in a set of transactions.
What is the support of an itemset?
The fraction of transactions that contain an itemset.
IOW: # transactions containing itemset / total # transactions
What is a frequent itemset?
An itemset whose support is greater than or equal to some specified minsup (minimum support) threshold
What is an association rule?
An implication expression of the form X –> Y, where X and Y are itemsets.
Example: {Milk, Diaper} –> {Beer}
What is the confidence of an association rule?
For the rule X–>Y:
It measures how often items in Y appear in transactions that contain X
Ex: {Milk, Diaper} –> {Beer}
c = #(Milk, Diaper, Beer} / #(Milk, Diaper)
What is the support of an association rule?
For the rule X–>Y:
It is the fraction of transactions that contain both X and Y
Ex: {Milk, Diaper} –> {Beer}
s = #(Milk, Diaper, Beer) / #Transactions
What are the two main tasks of association rule mining?
Given a set of transactions T, the goal of association rule mining is to find all rules having
- support >= minSup threshold
- confidence >= minConf threshold
Describe the brute-force approach to association rule mining
- List out all possible association rules (every combination of items)
- Compute the support and confidence for each rule
- Prune (remove) rules that fail to meet the minSup and minComp thresholds
* This takes WAY too long to compute for large datasets*
What is interesting about the support of rules that derive from the same itemset?
(Ex: A–>BC, B→AC, …)
Rules originating from the same itemset have identical support (though they can have different confidence)
This allows us to reduce the number of times we have to calculate support.
What is the two-step approach to mining association rules?
- Generate all itemsets whose support is >= minSup (frequent itemsets)
- Generate high confidence rules from each frequent itemset, where each rule is a binary partition of a frequent itemset
*Note: frequent itemset generation is still very expensive computationally
If a set AB is infrequent, what other information do we know (a priori)?
We know that any superset (set that contains AB) is also infrequent, because if AB is infrequent, a set that is more restrictive (ex: ABCDEFG) must also be infrequent.
Describe the strategy of reducing the number of candidates (M) to generate frequent itemsets.
First we do a complete search of the 2^d candidate itemsets
Then we use pruning techniques to reduce the number of candidates.
Describe the strategy of reducing the number of comparisons in generating frequent itemsets.
Use efficient data structures to store the candidates or transactions to minimize the number of comparisons.
There is no need to match every candidate against every transaction.