Embeddings & Locality sensitive hashing Flashcards
What is the key idea of embedding?
Convert objects to vectors, based on some notion of similarity.
What is Jaccard similarity?
It is the ratio of elements in the intersection of two sets and the union of two sets.
What is the surprising property of the jaccard similarity?
The probability over all permutations of the rows that h(c1) = h(c2) is the same as Sim(c1,c2)
What is min hashing?
It is an algorithm that creates a signature for a matrix based on some hash functions such that the similarity between columns is maintained
What is the pseudocode for min hashing?
for each row r
for each column c
if c has 1 in row r
for each hash h_i
if h(r) is smaller than M(i,c) then M(i,c) = h_i(r)
What is locality sensitive hashing?
It is using hashing functions that take locality into consideration. For example, projecting elements on the x-axis is a hashing method that keeps them ordered in order of x value.
How to find nearest neighbors using LSH?
Hash all data points using locality-sensitive hash
All points in the same bucket as query point are candidate near neighbors
Compute distances only to candidate points to find true NN
How to resolve collisions and splits when using LSH?
To resolve same bucket problem -> compute LSH from different directions. Points are candidate if they are candidate in all hash tables
To resolve split problem -> points are candidate if they are candidates in any of the hash tables
What is the MAJORITY algorithm?
Just count each item and find if there is one that appears more than half of the time.
What is the FREQUENT algorithm?
Find all elements in a sequence whose frequency exceeds 1/k fraction of the total count.
What is sketching?
A linear transform of the input.
What is a bloom filter?
Given a set of hash functions that map all elements of the universe to a predefined range and a bit vector with the size of the range, to add an element to W, compute the hashes of the element, and then set the corresponding bits to 1.
To test whether an element is in W, compute all the hashes, and sum up the returned bits.
What is the probability of a false positive when using a bloom filter?
(1 - (1 - 1/n)^km )^k, where k is the number of hashes, and m is the number of inserts.
What is count-min sketching?
Creates a small summary of a stream as an array of size wxd, where d is the number of hash functions that map values to [1, w]