C5 Flashcards
Jaccard similarity
Given two sets C1 and C2 we define the Jaccard similarity of C1 and C2 , Sim (C1,C2), as the ratio of the sizes of the intersection and union of C1 and C2:
Sim (C1,C2) = |C1 u C2|/|C1 n C2|
applications Jaccards similarity
- Similarity of Users (U1,U2) = Jaccard Similarity of sets of
movies watched by both users - Similarity of Movies (M1,M2) = Jaccard similarity of sets of users that watched both movies
- Given a huge number (millions) of documents, e.g., the Web, quickly find pairs of docs that have a lot of text in common (plagiarism, similar news articles)
3 steps for similarity testing
- Shingling: convert documents to sets
- Minhashing: convert large sets to short signatures, while preserving similarity
- Locality-sensitive hashing: focus on pairs of signatures likely to be similar.
shingles
- a k-shingle (or k-gram) for a document is a sequence of k consecutive characters that appear in the document.
- represent a doc by its set of k-shingles.
- documents that have lots of shingles in common have similar text, even if the text appears in different order => careful: you must pick k large enough, or most documents will have most shingles
- to compress long shingles, we can hash them to eg. 4 bytes => uint32, easy to compare and store => represent a doc by the set of hash values of its k-shingles
documents in matrix form
rows: (hashes of) shingles
columns: documents
- 1 in row r, column c, iff c has shingle r
- expect matrices to be sparse
minhashing
- imagine the rows permuted randomly
- define “hash” function h(C) = the position of the first (in the permuted order) row in which column C has 1 => store these positions in signature matrix
The probability (over all permutations of the rows) that h(C1) = h(C2) is the same as Sim(C1, C2)
why? => Prob(h(C1) = h (C2)) and Sim(C1, C2) = B/(B+L+R)
Thus, the test if h(C1) = h(C2) will return “yes” with probability
Sim (C1,C2). Therefore, repeating the test 100 times with various permutations h will lead to
a good estimate of Sim (C1, C2)
Locally Sensitive Hashing (LSH)
- split columns of signature matrix M into several blocks of the same size
- if two columns are very similar then it is very likely that at least at one block they will be identical. Instead of comparing blocks, send them to buckets with help of hash functions
- candidate pairs are those that hash at least once to the same bucket. Such pairs are often similar to each other, but rarely they might be not similar (“false positives”). Therefore, check if candidate pairs are really similar!
partition into bands
- divide matrix M into b bands of r rows
- for each band, hash its portion of each column to a hash table with k buckets
- candidate column pairs are those that hash to the same bucket for ≥ 1 band
- tune b and r to catch most similar pairs