3.Similarity Flashcards
What are some example applications of similarity search?
Plagiarism search, mirror pages, same source articles, collaborative filtering
What is a k-shingle?
Define a k-shingle for a document to be any substring of length k found within the document. Then, we may associate with each document the set of k-shingles that appear one or more times within that document.
How is k chosen in k-shingling?
k should be picked large enough that the probability of any given shingle appearing in any given document is low. A good rule of thumb is to imagine that there are only 20 characters and estimate the number of k-shingles as 20^k. For large documents, such as research articles, choice k = 9 is considered safe.
What is a stop-word-based shingle?
A shingle which is determined by the use of stop words (For example “for”, “you”, “A”, “of” a.s.o). The shingle is then the stop word followed by the next two words.
How much memory does a shingle take after hashing?
4 bytes
What is the probability that the minhash function for a random permutation of rows produces the same value for two sets?
The Jaccard similarity of those sets.
How are minhashes computed in practice?
Many matrices are too large to do a random permutation of the rows and compute a signature matrix (on the order of millions of elements). Therefore we create some hasfunctions on the rowindices which can then be used to compute a matrix with rows = hashes and columns = sets. So we go from Rows and sets to hashes and sets. We then go row by row in the row,set-matrix and see which sets has a 1. We then set the corresponding hash-set index in the signature to the hash if the set has a 1 and the hash is smaller than the element already in place in the signature. The elements of the signature are all initiated as infinity. The jaccard similarity of the sets in the signature are then approximations of the true jaccard similarity.
How can minhashing be sped up?
We only minhash the first m rows. As long as each column has at leastn one 1 in the first m rows in permuted order, the rows after the m:th will have no effect on any minhash value.
How is LSH applied for minhash signatures?
The signature matrix is divided up into b number of bands. A band is a collection of rows. We then use some hashfunction for a band to hash the signatures into a number of buckets. Each band has it’s own array of buckets. If two signatures in a band are hashed to the same bucket, they will be a candidate pair. There is a higher likelihood that that pair will be checked if they are hashed to the same bucket in multiple bands since they are more similar then. The general idea here is to not check every pair of columns for similarity and that the non-similar pair will never be checked since they won’t hash to the same bucket in any band.
What is the probability that the signatures agree in all rows of one particular band?
What is the probability that the signatures disagree in at least one row of one particular band?
What is the probability that the signatures disagree in at least one row of each of the bands?
What is the probability that the signatures agree in all the rows of at least one band, and therefore become a candidate pair?
s^r
1-s^r
(1-s^r)^b
1-(1-s^r)^b
What is the formula for euclidean distance (L2 Norm)?
sqrt(sum(x_i-y_i)^2)
What is the formula for jaccard distance?
d = 1 - sim(x,y) = XintersectY/(XunionY)
What is the formula for cosine distance?
d(angle) = XdotY/(||X||*||Y||) so the arccos of the dot product divided by the product of the L2 norms.
What is the hamming distance?
The number of positions where the arrays compared are different. For example the number of letters in a string that differs d(“Hamming”,”Hamnign”) = 3, d(011,100) = 3
How is LSH found for hamming distances?
Suppose we have a space of d-dimensional vectors, and h(x, y) denotes the Hamming distance between vectors x and y. If we take any one position of the vectors, say the ith position, we can define the function fi(x) to be the ith bit of vector x. Then fi(x) = fi(y) if and only if vectors x and y agree in the ith position. Then the probability that fi(x) = fi(y) for a randomly chosen i is exactly 1 − h(x, y)/d; i.e., it is the fraction of positions in
which x and y agree