Part 2: BI association rules Flashcards
Objective of association
Objective is finding interesting associations (relationships) between attributes in a data set. -> no classification because we do not have a class variable, that’s why it is unsupervised learning.
Rules of association
Antecedent -> consequent
LHS -> RHS
antecedent / #LHS
Number of items (records) in the database that match with antecedent.
Indices for rules
- Support (Coverage) =#(LHS and RHS)/#DB or
#(antecedent and consequent) / #DB - Accuracy (Confidence) = #(LHS and RHS)/#LHS or
#(antecedent and consequent) / #antecedent
Frequent itemsets
- Item: one attribute - value pair
- Itemset: all items occurring in a transaction or record
- Frequent itemset - an itemset with minimal support k, i.e., support exceeding a threshold k predefined by the user.
Itemsets association rules
- Association rule: IF - THEN format.
+ LHS, RHS - one item (attribute-value pair) or conjunction of items. - One itemset -> many association rules.
Apriori property
e.g. itemset (A, B, C)
Support(A, B, C) >_ k -> support(A, B) >_ k, support(B, C) >_k, etc. for all subsets.
Note: opposite may not be true.
N-itemsets with minimal support
- Find all 1-itemsets with minimum support.
- Store them in file1.
- Compute all 2-itemsets by combining 1-itemsets.
- Store 2-itemsets with minimum support in file2.
- Compute all 3-itemsets by combining 2-itemsets.
- Store 3-itemsets with minimum support in file3.
- etc.
Finding association rules
- A typical question: “find all association rules with support >_ s and confidence >_ c.”
Note: “support” of an association rule is the support of the set of items it mentions. - Hard part: finding the high-support (frequent) itemsets.
+ checking the confidence of association rules involving those sets is relatively easy.
Apriori algorithm
- Definition = algorithm for finding association rules.
- Description =
Step 1: find all frequent itemsets with minimal support k.
Step 2: from all frequent itemsets found in step 1, find the association rules with minimal accuracy m.
Rule interestingness measures
- Objective measures: \+ support \+ confidence \+ lift - Subjective measures: A rule (pattern) is interesting if \+ it is unexpected (surprising to the user) \+ actionable (the user can do something with it)
Benchmark confidence
- Confidence = #(antecedent and consequent) / #antecedent.
- Assume antecedent and consequent are independent.
- Then: confidence = PriorProb(consequent)
Note: Prob of an event can be estimated by fraction of records in the database this fact occurs.
Lift measure
Tells us how strong the relation is between the antecedent and the consequent.
Rule: LHS -> RHS or antecedent -> consequent
Lift: Confidence/Prob(RHS) = Prob(LHS and RHS)/(Prob(LHS)*Prob(RHS))
Prob(RHS) = benchmark confidence
We assume that fractions in database are good approximations for probability.
- Lift = 0 -> means that fr(RHS and LHS) = 0
- Lift = 1 -> means that RHS and LHS are independent.
- Lift»_space; 1 -> most interesting rule, LHS strong indicator for RHS. Sometimes also «_space;1 is interesting.
Summary
- Association belongs to unsupervised learning.
- Association rules vs. classification (decision) rules:
+ classification rules predict only one attribute, called class, whereas association rules find associations between attributes without distinction.
+ RHS of an association rule may contain conjunction of attribute-value pairs, whereas RHS of a classification rule contains only the class value.
+ association rules are not intended to be used together as a set, whereas classification rules are.