3.Similarity Flashcards

Question 1

Q

What are some example applications of similarity search?

Answer

A

Plagiarism search, mirror pages, same source articles, collaborative filtering

Question 2

Q

What is a k-shingle?

Answer

A

Define a k-shingle for a document to be any substring of length k found within the document. Then, we may associate with each document the set of k-shingles that appear one or more times within that document.

Question 3

Q

How is k chosen in k-shingling?

Answer

A

k should be picked large enough that the probability of any given shingle appearing in any given document is low. A good rule of thumb is to imagine that there are only 20 characters and estimate the number of k-shingles as 20^k. For large documents, such as research articles, choice k = 9 is considered safe.

Question 4

Q

What is a stop-word-based shingle?

Answer

A

A shingle which is determined by the use of stop words (For example “for”, “you”, “A”, “of” a.s.o). The shingle is then the stop word followed by the next two words.

Question 5

Q

How much memory does a shingle take after hashing?

Question 6

Q

What is the probability that the minhash function for a random permutation of rows produces the same value for two sets?

Answer

A

The Jaccard similarity of those sets.

Question 7

Q

How are minhashes computed in practice?

Answer

A

Many matrices are too large to do a random permutation of the rows and compute a signature matrix (on the order of millions of elements). Therefore we create some hasfunctions on the rowindices which can then be used to compute a matrix with rows = hashes and columns = sets. So we go from Rows and sets to hashes and sets. We then go row by row in the row,set-matrix and see which sets has a 1. We then set the corresponding hash-set index in the signature to the hash if the set has a 1 and the hash is smaller than the element already in place in the signature. The elements of the signature are all initiated as infinity. The jaccard similarity of the sets in the signature are then approximations of the true jaccard similarity.

Question 8

Q

How can minhashing be sped up?

Answer

A

We only minhash the first m rows. As long as each column has at leastn one 1 in the first m rows in permuted order, the rows after the m:th will have no effect on any minhash value.

Question 9

Q

How is LSH applied for minhash signatures?

Answer

A

The signature matrix is divided up into b number of bands. A band is a collection of rows. We then use some hashfunction for a band to hash the signatures into a number of buckets. Each band has it’s own array of buckets. If two signatures in a band are hashed to the same bucket, they will be a candidate pair. There is a higher likelihood that that pair will be checked if they are hashed to the same bucket in multiple bands since they are more similar then. The general idea here is to not check every pair of columns for similarity and that the non-similar pair will never be checked since they won’t hash to the same bucket in any band.

Question 10

Q

What is the probability that the signatures agree in all rows of one particular band?

What is the probability that the signatures disagree in at least one row of one particular band?

What is the probability that the signatures disagree in at least one row of each of the bands?

What is the probability that the signatures agree in all the rows of at least one band, and therefore become a candidate pair?

Answer

A

s^r

1-s^r

(1-s^r)^b

1-(1-s^r)^b

Question 11

Q

What is the formula for euclidean distance (L2 Norm)?

Answer

A

sqrt(sum(x_i-y_i)^2)

Question 12

Q

What is the formula for jaccard distance?

Answer

A

d = 1 - sim(x,y) = XintersectY/(XunionY)

Question 13

Q

What is the formula for cosine distance?

Answer

A

d(angle) = XdotY/(||X||*||Y||) so the arccos of the dot product divided by the product of the L2 norms.

Question 14

Q

What is the hamming distance?

Answer

A

The number of positions where the arrays compared are different. For example the number of letters in a string that differs d(“Hamming”,”Hamnign”) = 3, d(011,100) = 3

Question 15

Q

How is LSH found for hamming distances?

Answer

A

Suppose we have a space of d-dimensional vectors, and h(x, y) denotes the Hamming distance between vectors x and y. If we take any one position of the vectors, say the ith position, we can define the function fi(x) to be the ith bit of vector x. Then fi(x) = fi(y) if and only if vectors x and y agree in the ith position. Then the probability that fi(x) = fi(y) for a randomly chosen i is exactly 1 − h(x, y)/d; i.e., it is the fraction of positions in
which x and y agree

Question 16

Q

How is LSH found for cosine distance?

Answer

Study These Flashcards

A

We start by picking a random hyperplane that crosses the plane made up by x and y in a line. To pick the hyperplane, we define a normal vector v_f and find the set of points that give a 0 dot product with the normal. Each hash function f in our locality-sensitive family F is built from a randomly chosen vector vf . Given two vectors x and y, say f(x) = f(y) if and only if the dot products vf.x and vf.y have the same sign. Then F is a locality-sensitive family for the cosine distance.

Question 17

Q

How is the sketch of a vector calculated? What is the sketch used for?

Answer

Study These Flashcards

A

Instead of choosing a random vector from all possible vectors, it turns out to be sufficiently random if we restrict our choice to vectors whose components are
+1 and −1. If we pick a collection of random vectors, say v1, v2, . . . , vn, then we can apply them to an arbitrary vector x by computing v1.x, v2.x, . . . , vn.x and then
replacing any positive value by +1 and any negative value by −1. The result is called the sketch of x. It can be used to estimate the angle between two vectors.

3.Similarity Flashcards

(17 cards)