Chapter 3 Flashcards
What is association rule mining?
An unsupervised data mining technique that finds interesting associations and/or correlation relationships among a large set of data items.
What do association rules show?
Attribute-value conditions that occur frequently together in a given dataset
What is a use of association rule mining?
Market basket analysis
What is a transaction?
A single vist
Each transaction is associated with a purchase date and items purchased.
What does basket data refer to?
Transaction data
What does the market-basket problem assume?
We have some large number of items (eg all items available in store). Customers fill their baskets with some subset of the items and we get to know what items people buy together, even if we don’t know who they are.
What can retail organisations gain by analysing basket data?
They can extract information to drive their market strategy. They would would want to put effort into items that are frequently purchased.
- Targeted marketing
- Plan store layouts - use the information to position items and control the way a typical customer traverses the store. May place commonly purchased together items in close proximity to encourage the sale of such items together, or may place them on opposite ends of the store to entice customers to pick up other items on the way.
- Help retailers plan which items to put on sale at reduced prices
What is a boolean vector in market basket analysis?
Each basket can be replaced by a Boolean vector if we considered that each item has a Boolean variable representing the presence or absence of the item.
The boolean vector can be analysed for buying patterns that reflect items that are frequently associated or purchased together. These patterns can be represented in the form of association rules.
Aside from the marketing application, where else can Association rules be used?
- Baskets = documents, items = words
Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering, predictive text etc. - Baskets = documents, items = sentences
Two documents with many of the same sentences could represent plagiarism or mirror sites on the web
What form do association rules take?
Association rules are statements of the form
{X1, X2, …, Xn} => Y
Meaning if we find all of X1, X2, …, Xn in the market basket, then we have. good chance of finding Y.
What is the confidence of the association rule and how can it be determined?
The strength of implication in the rule
Determine the probability of finding Y given X1, X2, …, Xn occur.
Which rules are we interested in?
Rules with a confidence above a certain threshold.
We may also ask that the confidence be significantly higher than it would be if items were placed at random into baskets.
eg we may find a rule like {milk, butter} => bread - simply because a lot of people buy bread
What are the properties of market-basket analysis?
- Association rules
- Causality
- Frequent itemsets
Why do we want to consider causality?
Correlation does not equal causation. From the marketing strategy perspective, it is important to understand where the causation is coming from.
We want to know that in an association rule, the presence of X1, X2, …, Xn actually “causes” Y to be bought. eg diapers and beer
Why do we want to consider frequent itemsets?
In most situations, we only care about association rules or causalities involving sets of items that appear frequently in baskets.
We can’t run a good marketing strategy involving items that no one buys.
Data mining starts with the assumption that we only care about sets of items with high support ie they appear together in many baskets. Sets of items must appear in at least a certain percent of the baskets called the support support threshold.
What is the support threshold?
A certain percentage of baskets that an item must appear in to be considered.
What is the symbolised version of “transactions can be considered to be a subset of the set of all possible items”?
T ⊆ I
What is the formula for the support s?
[See flashcard]
What is the formula for the confidence c?
[See flashcard]
What does a higher confidence of the association rule A => B mean?
The greater the probability that if a customer buys product A, they will also buy product B.
What way do we write support and confidence values?
To occur between 0% and 100% (rather than 0 and 1)
What are confidence and support values compared to when association rules are created?
Confidence and support thresholds
What is an itemset?
A set of items
What is a k-itemset?
An items containing k items
What is the occurrence frequency or support count of an items?
The number of transactions that contain the itemset
What is a frequent items?
A set of items that appears in at least fraction s of the baskets, where s, the support, is some pre-defied chose constant.
What does Lk denote?
The set of frequent k-itemsets
What can we reduce the problem of mining association rules to?
That of mining frequent itemsets.
What is the two-step process of association rule mining?
- Find all frequent itemsets
Each of these itemsets will occur at least as frequently as a predetermined support count, min_sup - Generate strong association rules from the frequent itemsets
If all association rules were generated from frequent itemsets, only those above a minimum level of confidence are retained as strong association rules.
What is a strong association rule?
One that satisfies minimum support and minimum confidence.
Which of the two steps is less costly?
Generating strong association rules is much less costly than finding all frequent itemsets. Hence, the overall performance of mining association rules is determined by the first step.
How can frequent pattern mining can be classified?
There are three criteria
- Based on levels of abstraction
- Based on the number of data dimensions involved in the rule
- Based on the type of values handled in the rule
What are rules at different levels of abstraction?
The items bought are referenced at different levels of abstraction (eg computer is a higher-level abstraction than laptop).
We refer to the rule set mined as consisting of multi-level association rules.
What are rules based on the number of data dimensions involved in the rule?
If the items or variables in an association rule reference only one dimension, then it is a single dimensional association rule.
A rule with dimensions such as age and income, is a multi-dimensional association rule.
What are rules based on the types of values handled in the rule?
If a rule involves association between presence or absence items, it is a Boolean association rule.
If the rule involves associations between quantitative items of variables, these are discretised and the rule is referred to as a quantitative association rule.
Considering mining single dimensional Boolean itemsets, what is the first step in the frequent items mining process?
Finding all frequent itemsets
What is the Naive algorithm?
In this algorithm, one considers all possible subsets of I (items in the shop) and in each case, the support is calculated. Only subsets above the minimum support threshold are considered to be frequent itemsets.
What is the issue with the Naive algorithm?
This algorithm requires the evaluation of all subsets of the items I - a large number. There are 2^m subsets of I and to calculate the support of each subset, there are (2^m * n) order of operations - the computational effort grows exponentially with the number of items m.
What algorithm is used to reduce the computational effort seen in the naive algorithm?
The Apriori algorithm which utilises the Priori property.
What does the Apriori property state?
That if a (k-1) items A is not a frequent items, then any superset of A, B (ie A is a subset of B), will also not be a frequent itemset.
It also indicates that all non-empty subsets of a frequent items must also be frequent.
What kind of approach does the Apriori algorithm employ?
An iterative approach, known as level-wise search, where frequent (k-1)-itemsets are used to obtain potential (or candidate) frequent (k)-itemsets.
Briefly describe the apriori algorithm
- Given a specified support threshold s, the first pass finds the items that appear in at least fraction s of the baskets. This is all frequent 1-itemsets called L1
- Pairs of items in L1 become the candidate pairs C2 for the second pass. The pairs in C2 whose count reaches s are the frequent pairs, L2. This is all frequent 2-itemsets called L2. Candidate pairs who do not meet the minimum support requirement are pruned.
- The items in L2 are then used to create C3, which then form L3 etc.
- This iterative process continues until no more frequent itemsets can be found
What is the general action of the Apriori method?
L(k-1) is used to define Lk for k >= 2
What are the two steps of the Apriori method?
Join and Prune action
Define the join step of the apriori method
- To find Lk of a set of candidate k-itemsets is generated by joining Lk-1 with itself.
- This set of candidates is denoted Ck
- Members of Lk-1 are joinable if their first (k-2) items are in common
Define the prune step of the prior method
- Ck is a superset of Lk - its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck
- A scan of the database to determine the count of each candidate in Ck results in the determination of Lk
- We can compare members of Ck to those in Ck-1 to determine Lk by removing members straight away - if all items within an infrequent member Ck-1 are also within a member of Ck, then that member of Ck can be eliminated straight away from Lk
When does the join/prune cycle stop?
When Lk = Null
What can you do once the frequent itemsets from transactions in a database have been found?
Generate strong association rules from them
Strong association rules satisfy minimum support and minimum confidence.
What are the equations associated with generating association rules from frequent itemsets?
[See flashcard]
Why does each rule automatically satisfy minimum support?
The rules are generated from frequent itesmets
What are the advantages of association rules?
- Algorithm is very scalable ie capable of working with large amounts of transactional data
- Result sin rules are very easy to understand
- Useful for data mining and discovering unexpected knowledge in databases
What are the disadvantages of association rules?
- Not very helpful for small datasets
- Requires effort to separate the true insight from common sense
- Easy to draw spurious conclusions from random partners