05. Association Rules Flashcards
Association Rules (Market Basket Analysis) is what
Association Rules is an unsupervised, descriptive, method to discover interesting relationships. The disclosed relationships can be represented as rules or frequent itemsets. Commonly used for mining transaction databases.
The Association Rules, if “X” is observed then “Y” has a high probability of being observed approach, can be applied to which fields
Which products tend to be purchased together?
Of those customers who are similar to this person, what products do they tend to buy?
Of those customers who have purchased this product, what other similar products do they tend to view or purchase?
In the rule “when item’s X is observed, then item’s Y is also observed” what is X and what is Y
X is called antecedent or left-hand-side (LHS)
Y is called consequent or right-hand-side (RHS)
What is the notation and meaning of a k itemset
In a k itemset the k refers to the total number of items in that itemset {item 1, item 2, …, item k}
What is the underpinning idea of the Apriori algorithm
It is a method of “pruning” the otherwise exponential associations by considering the “downward closure property” which is to say that if an itemset is considered frequent, then any subset of the frequent itemset must also be frequent.
What is a frequent itemset
A frequent itemset has items that appear together often enough. The term “often enough” is formally defined with a minimum support criterion. If the minimum support is set at 0.5, any itemset can be considered a frequent itemset if at least 50% of the transactions contain this itemset. In other words, the support of a frequent itemset should be greater than or equal to the minimum support.
What is the Apriori algorithm method
The Apriori algorithm takes a bottom-up iterative approach to uncovering the frequent itemsets by first determining all the possible items (or 1-itemsets, for example {bread}, {eggs}, {milk}, …) and then identifying which among them are frequent. Assuming the minimum support threshold (or the minimum support criterion) is set at 0.5, the algorithm identifies and retains those itemsets that appear in at least 50% of all transactions and discards the itemsets that have a support less than 0.5 (appear in fewer than 50% of the transactions). It then repeats this with the 2-itemsets
In Association Rules what is Support
Support (X => Y) =
( Number of transactions with both X and Y ) /
The total number of transactions
Support is an indication of how frequently the itemset appears in the dataset - this is just the probability of that combination appearing!
In Association Rules what is Confidence
Confidence (X => Y) =
( Number of transactions with both X and Y ) /
The total number of transactions containing X
Confidence is an indication of how often the rule has been found to be true
In Association Rules what is Lift
Lift (X => Y) =
(Support (X and Y)) /
((Support of X)*(Support of Y))
Lift (X => Y) =
P(X,Y)/((P(X)*P(Y))
Lift indicates how likely Y itemset is to be picked along with itemset X than by itself expressed as a ratio
It is a multiplier of the normal chance
(Support of X) * (Support of Y) is the assumption that if these were entirely independent then the probability of getting this result is the P(X) x P(Y) like dice rolls
In Association Rules what is Leverage
Leverage (X => Y) =
(Support (X and Y)) - ((Support of X)*(Support of Y))
Leverage (X=>Y) =
P(X,Y)-(P(X)*P(Y))
Leverage indicates how likely Y itemset is to be picked along with itemset X than by itself expressed as a difference
(Support of X) * (Support of Y) is the assumption that if these were entirely independent then the probability of getting this result is the P(X) x P(Y) like dice rolls
Explain the benefit of know Lift
If X occurred independently from Y then Lift = 1. When two events are independent of each other no rule can be drawn involving those two events.
If Lift >1, (greater than 1) that lets us know the degree to which two occurrences are dependent on one anotherand makes those rules potentially useful for predicting the consequent in future data sets.
If Lift >1 (less than 1) suggests a negative association – where purchasing one item reduces the probability of buying the other.
Note that is the lift is zero, then they are exclusive, buying one means not buying the other.
The first iteration of the Apriori algorithm does what
Looks at the support of the itemsets which contain only one item, given support is X=>Y/all transations, and we are looking at X in isolation, the first calculation of support is just X/All hence the % of instances. So say the support needs to be at least 2% means that only items that appear in a frequency of at least 2% will be taken to the next level. The rest are “pruned”.
What is the syntax in R for applying the Apriori association algorithm
itemsets = apriori ( Groceries, parameter=list (minlen=1, maxlen=1, support=0.02, target=”frequent itemsets”)
What happens in the lead into step two of applying the Apriori association algorithm in R
All of the items that survived the first round are now joined into combos ie. 1, 3, 7 were considered frequent (had a great enough support) so combos 13, 17 & 37 will now be assessed.