C5 Flashcards

Question 1

Q

Jaccard similarity

Answer

A

Given two sets C1 and C2 we define the Jaccard similarity of C1 and C2 , Sim (C1,C2), as the ratio of the sizes of the intersection and union of C1 and C2:
Sim (C1,C2) = |C1 u C2|/|C1 n C2|

Question 2

Q

applications Jaccards similarity

Answer

A

Similarity of Users (U1,U2) = Jaccard Similarity of sets of
movies watched by both users
Similarity of Movies (M1,M2) = Jaccard similarity of sets of users that watched both movies
Given a huge number (millions) of documents, e.g., the Web, quickly find pairs of docs that have a lot of text in common (plagiarism, similar news articles)

Question 3

Q

3 steps for similarity testing

Answer

A

Shingling: convert documents to sets
Minhashing: convert large sets to short signatures, while preserving similarity
Locality-sensitive hashing: focus on pairs of signatures likely to be similar.

Question 4

Q

shingles

Answer

A

a k-shingle (or k-gram) for a document is a sequence of k consecutive characters that appear in the document.
represent a doc by its set of k-shingles.
documents that have lots of shingles in common have similar text, even if the text appears in different order => careful: you must pick k large enough, or most documents will have most shingles
to compress long shingles, we can hash them to eg. 4 bytes => uint32, easy to compare and store => represent a doc by the set of hash values of its k-shingles

Question 5

Q

documents in matrix form

Answer

A

rows: (hashes of) shingles
columns: documents
- 1 in row r, column c, iff c has shingle r
- expect matrices to be sparse

Question 6

Q

minhashing

Answer

A

imagine the rows permuted randomly
define “hash” function h(C) = the position of the first (in the permuted order) row in which column C has 1 => store these positions in signature matrix

The probability (over all permutations of the rows) that h(C1) = h(C2) is the same as Sim(C1, C2)
why? => Prob(h(C1) = h (C2)) and Sim(C1, C2) = B/(B+L+R)

Thus, the test if h(C1) = h(C2) will return “yes” with probability
Sim (C1,C2). Therefore, repeating the test 100 times with various permutations h will lead to
a good estimate of Sim (C1, C2)

Question 7

Q

Locally Sensitive Hashing (LSH)

Answer

A

split columns of signature matrix M into several blocks of the same size
if two columns are very similar then it is very likely that at least at one block they will be identical. Instead of comparing blocks, send them to buckets with help of hash functions
candidate pairs are those that hash at least once to the same bucket. Such pairs are often similar to each other, but rarely they might be not similar (“false positives”). Therefore, check if candidate pairs are really similar!

Question 8

Q

partition into bands

Answer

A

divide matrix M into b bands of r rows
for each band, hash its portion of each column to a hash table with k buckets
candidate column pairs are those that hash to the same bucket for ≥ 1 band
tune b and r to catch most similar pairs

C5 Flashcards

(8 cards)