C5 Flashcards

1
Q

Jaccard similarity

A

Given two sets C1 and C2 we define the Jaccard similarity of C1 and C2 , Sim (C1,C2), as the ratio of the sizes of the intersection and union of C1 and C2:
Sim (C1,C2) = |C1 u C2|/|C1 n C2|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

applications Jaccards similarity

A
  • Similarity of Users (U1,U2) = Jaccard Similarity of sets of
    movies watched by both users
  • Similarity of Movies (M1,M2) = Jaccard similarity of sets of users that watched both movies
  • Given a huge number (millions) of documents, e.g., the Web, quickly find pairs of docs that have a lot of text in common (plagiarism, similar news articles)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 steps for similarity testing

A
  1. Shingling: convert documents to sets
  2. Minhashing: convert large sets to short signatures, while preserving similarity
  3. Locality-sensitive hashing: focus on pairs of signatures likely to be similar.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

shingles

A
  • a k-shingle (or k-gram) for a document is a sequence of k consecutive characters that appear in the document.
  • represent a doc by its set of k-shingles.
  • documents that have lots of shingles in common have similar text, even if the text appears in different order => careful: you must pick k large enough, or most documents will have most shingles
  • to compress long shingles, we can hash them to eg. 4 bytes => uint32, easy to compare and store => represent a doc by the set of hash values of its k-shingles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

documents in matrix form

A

rows: (hashes of) shingles
columns: documents
- 1 in row r, column c, iff c has shingle r
- expect matrices to be sparse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

minhashing

A
  • imagine the rows permuted randomly
  • define “hash” function h(C) = the position of the first (in the permuted order) row in which column C has 1 => store these positions in signature matrix

The probability (over all permutations of the rows) that h(C1) = h(C2) is the same as Sim(C1, C2)
why? => Prob(h(C1) = h (C2)) and Sim(C1, C2) = B/(B+L+R)

Thus, the test if h(C1) = h(C2) will return “yes” with probability
Sim (C1,C2). Therefore, repeating the test 100 times with various permutations h will lead to
a good estimate of Sim (C1, C2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Locally Sensitive Hashing (LSH)

A
  • split columns of signature matrix M into several blocks of the same size
  • if two columns are very similar then it is very likely that at least at one block they will be identical. Instead of comparing blocks, send them to buckets with help of hash functions
  • candidate pairs are those that hash at least once to the same bucket. Such pairs are often similar to each other, but rarely they might be not similar (“false positives”). Therefore, check if candidate pairs are really similar!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

partition into bands

A
  • divide matrix M into b bands of r rows
  • for each band, hash its portion of each column to a hash table with k buckets
  • candidate column pairs are those that hash to the same bucket for ≥ 1 band
  • tune b and r to catch most similar pairs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly