Vector Search Flashcards
What does FAISS stand for?
Facebook AI Similarity Search
Describe the vector search problem
We have a set of vectors, and for some query vector, we want to find the vectors that are most similar.
What is a Voronoi diagram?
It’s a division of the plane into cells, where each cell is all the space closer to one particular point than any others. It’s that point’s “territory.”
Conceptually, if some query point falls in a Voronoi cell, what does that mean?
It means you know which point is closest.
What does “probing” mean?
It means searching nearby cells, not just the one you landed on.
What does IVF stand for?
InVerted File
What is the IVF strategy?
It’s a strategy for vector search. You pick some set of key points, make a Voronoi diagram of them, and then find the Voronoi cell for your query vector and search all the points in that cell – plus maybe probe some nearby ones.
What does PQ stand for?
Product Quantization
What is the point of PQ?
It’s to make the vectors themselves smaller, so that L2 distance is easier to calculate.
How do you do Product Quantization?
First, chunk the vectors up into subvectors. Then, for a particular set of subvectors, do clustering, so you get centroids. Finally, replace each subvector with the cluster id.
PQ maps from what space size to what space size?
It maps from the original vector space, like R 100, to a space of the number of subvectors, like R4 if you’re breaking up into 4 subvectors.
What are codewords?
They’re the centroids from Product Quantization.
What is a codebook?
The set of all centroids for a particular subvector subspace in Product Quantization.
What is Indexing?
It’s the strategy you use to store your vectors in your vector space – you put them in an index. And then you need to be able to search within that index.
What is a “flat index”?
It’s one where you don’t modify the vectors at all and just store them directly.
How would you lookup in a flat index?
You would literally just do KNN with the entire index.
What does KNN stand for?
K Nearest Neighbors.
How do you do KNN?
Calculate distance between the query vector and every single other vector. Then return the top K closest.
What are the pros and cons of a flat index?
Pros: Search quality is maximized since you are literally just doing KNN. Cons are that this takes forever.
What does ANN stand for?
Approximate Nearest Neighbors
What does LSH stand for?
Locality Sensitivity Hashing
How do you do LSH?
Write some hash function that puts similar vectors in the same hash bucket. Then at lookup time, hash the query vector to get its group of similar vectors.
What does it mean “maximize collisions” in the context of LSH?
It means you want the hash function to put similar vectors in the same bucket (collide them).
What are the pros and cons of LSH?
Pros: It’s a middle ground on everything – decent performance but also decent speed and storage. Cons are that it’s not amazing at any one thing, and also susceptible to the curse of dimensionality.
What does NSW stand for?
Navigable Small World
What is an NSW graph?
It’s one where each node is connected to each of its nearest neighbors.
Where does the SW in NSW come from?
It’s “Small World” and it comes from a Facebook study where they found that you can get between any two people in the world in an average of 3.57 steps.
What does the H in HNSW mean?
“Hierarchical” – it means you break the graph up into different layers of granularity and go from one to the next.
What are the pros and cons of HNSW?
Pros are that it is both high-quality and pretty fast. Cons are it eats up tons of memory.
What are the pros and cons of IVF?
Pros are it excels at all three of quality, memory, and speed. Not sure it has cons, other than still being approximate relative to the KNN lookup of a flat index.
What is Jaccard Similarity?
It’s a measure of the similarity of two sets – the cardinality of their intersection divided by the cardinality of their union.
What is k-shingling? Give an example.
It’s when you slide a window of size k along a string to create tokens of size k. Example: Andrew w/ k=3 is And, ndr, dre, rew.
What is MinHashing related to?
It’s a step in LSH, it’s how you create a representation for sparse vectors where they’ll have a high Jaccard score if they’re similar.
What is random projection?
It’s when you create a bunch of random hyperplanes that divide the space between 1s and 0s, and then the representation of the vector is the 1/0 from each hyperplane, which you find using the dot product.
How can you use the dot product to find which side of a hyperplane a vector is on?
Let the normal vector be the vector orthogonal to the hyperplane. Then take the dot product between it and your vector. Dot product is positive when angle <90, and negative when angle >90. Since angle=90 is the hyperplane, this tells you if the angle puts the vector on one side or the other of the hyperplane.
What is the Hamming distance?
It’s the number of mismatches between two vectors, e.g. 1110 and 0110 have one mismatch.
What is Quantization?
It’s a generic term referring to compressing data into small space?
What is the difference between dimensionality reduction and quantization?
Dimensionality reduction is trying to get fewer dimensions. Quantization is trying to reduce the scope of each dimension, like from 32 bits required to describe a particular dimension down to 4.
What is IVFPQ?
It’s IVF (Inverted File, the Voronoid cells indexing strategy) but with PQ first, so that you’re doing IVF just on the PQ centroids.
What is recall? In the context of vector search?
It’s the number of true positives divided by all the real positives. So it’s the % of real positives that you were able to identify as positive. In the context of vector search this means the % of the K-nearest-neighbors that you were able to correctly identify with your ANN approach.
What are the pros and cons of PQ?
Pros are that it takes like 97% less memory and makes ANN 5-10x faster. Cons are it lowers your recall rate.
In IVFPQ what do you do first, the IVF or the PQ?
The PQ.
What does OPQ stand for?
Optimized Product Quantization
What are you doing in OPQ?
It’s where you transform the vectors to maximize the variance in each subvector space, before running PQ.