Weeks 8-9: Association Rules & Temporal Data Mining Flashcards
Association Rule
An expression in if-then format.
The if is the precondition or antecedent. It has a series of tests.
The then part is the conclusion or consequent. It assigns values to one or more attributes.
Association rules may incorporate any attribute or combination of attributes as the target.
Bottom-up Generate-and-test
A procedure for generating association rules that frequently occur in the rule’s right-hand side.
Coverage/Support Count
The number of instances covered by the rule.
Support
Proportion of instances covered by the rule.
Confidence/accuracy
Proportion of instances that the rule predicts correctly over all instances satisfying the precondition.
Apriori Algorithm
This is an algorithm for generating association rules.
- Find all 1-item sets, calculate their coverage and discard any 1-item sets below the predefined min coverage value.
- Find all 2-item sets by making pairs of 1-item sets and retain only 2-item sets with coverage above the minimum coverage value.
- Similarly, find all 3-item sets, 4-item sets, and so on.
- Repeat until no larger item sets are possible.
- For each item set, produce as many association rules as the number of ways to split the relevant items between the precondition and the conclusion.
- For each rule, evaluate the confidence and retain if the rule meets the minimum requirement.
Note that the algorithm implements the property that, if an item-set isn’t frequent, then all its supersets are also not frequent. The algorithm produces k+1 items only from k-item sets. The worst-case running of the algorithm is exponential with respect to the number of attributes.
FP-Growth Algorithm
- Count each item frequency and sort in descending order.
- Perform a second pass of the data to create the FP-tree.
- Recursively process the tree to identify frequent item sets.
The tree structure allows for more efficient extracting of association rules from a dataset. Usually, it has better run time than the apriori algorithm.
Header Table
It appears on the left hand side of the FP-tree. It shows the frequencies of individual items satisfying the minimum coverage threshold, sorted in descending order.
Lift
This metric measures the usefulness of a rule by comparing with respect to random guessing. It compares the confidence of the rule to the confidence of the null rule, which has the same right-hand side, but empty left-hand side.
Contingency Table
There are two dimensions to this table.
Does the antecedent of the association rule appear in the transaction?
No: Absent
Yes: Present
Doe the consequent of the association rule appear in the transaction?
No: Absent
Yes: Present
The number of transactions is equal to the sum of the four resulting cells. If the cell for antecedent and consequent both being present is higher than the other cells, then this is a good indication that the association rule is good.
Sequential Analysis/Sequential Pattern Analysis
This method adds the element of time to association analysis. It considers the sequences, not just the associations, of items, where the order in time is important.
As such, the support of an item set also depends on the order of the items. Generalised sequential pattern (GSP) is a suitable algorithm for sequence mining based on the apriori algorithm.
Link Analysis
Approach for analysing and exploiting relationships and connections between items. It utilises graphs, with items as nodes and relationships as edges. Can be both supervised and unsupervised in training.
Graphs can be directed, undirected, multigraph (when multiple edges between the same pair of nodes are allowed), and connected (if there’s a path between every pair of nodes).
Degree Centrality
This metric evaluates the relative importance of a node to the graph.
Closeness Centrality
This metric measures how close a node is to all the other nodes in the graph.
Time Series
A set of values obtained from measurements over time. A dataset may have a single or multiple time series.
T = (x_1,…,x_n)
x_t = time series value at time t
Decomposition
Traditional approaches decompose a time series into additive components:
x_t = T_t + C_t + S_t + \epsilon_t
T_t: Trend, variations of low frequency
S: Seasonality, regular predictable changes
C: Cycle, periodic variations
\epsilon_t: Error, captures uncertainty of the model