Terms Flashcards
Girvan-Newmann
Calculate betweenness in a network (using the leaf / parent scoring system)
Betweenness
How often is an edge used as a path between nodes?
Support (itemsets)
Number of baskets for which an itemset is a subset
Confidence (if I then j)
Support for I divided by the support for I and j together,. (If diapers then beer). If its 1 then its always the case
Interest (If I then j)
How interesting is a fact? If diapers then beer is not interesting if beer is in every basket. Confidence in I to j minus fraction of baskets that contain j
Support threshold
Find all frequent itemsets that appear in at least s baskets
What is a triangular matrix used for?
Efficient way to store count of pairs; (pair 2,1 count is the same as pair 1,2 count).
What does the A-priori algorithm do?
Find frequent K-itemsets that appear in at least s baskets. You count all pairs and then filter out the infrequent ones. You then construct new pairs (K + 1) and count again. Filter and repeat until your K is reached.
What does the PCY algorithm do?
Like A-priori but we reduce the amount pairs that need to be counted by hashing pairs on K=1 and counting. When constructing a candidate pair, only make it a candidate if the pair hashes to a ‘frequent bucket’.
What does the multistage algorithm do?
Like PCY, but a second hash round. First hash table contains a lot of false negatives (multiple pairs hash to same bucket and can have a count > s together). We generate each pair again, and iff it is in a frequent bucket in hash 1 then we hash it to hash 2. Then we count again
What does the multihash do?
Like multistage, but doesnt require the extra pass through the baskets. Has multiple hashes, and hashes pairs to all hashes. When counting, increment it in all hash tables. Then only construct candidate pair if it is in a frequent bucket in ALL hashes. (avoid false negatives)
Random Sampling (itemsets)
If too much data; we can pick a sample and run A-priori / PCY on it. We then validate the result against the entire dataset.
SON algorithm (itemsets)
We can divide the data up into parts. Then for each part, run A-priori / PCY. Then join the results and validate (count).
What is the monotonicity property?
An itemset cannot be frequent in the entire set if it is not frequent in at least one subset.
How does hierarchical clustering work?
Iteratively join the clusters with closest centroid. Stop when you are at K clusters.