VectorSearch Flashcards
What is K-Approximate Nearest Neighbor (K-ANN)?
Algorithm to find K nearest elements to a query vector in a large dataset
Why are approximate algorithms (ANNS) used instead of exact methods?
Because exact methods become computationally expensive as dataset size and vector dimensionality increase
What is HNSW?
Hierarchical Navigable Small World - an ANNS algorithm offering logarithmic search complexity based on navigable small-world graphs
What is a skip list?
A probabilistic data structure using additional levels of pointers to speed up search insertion and deletion in an ordered list
What are the three main advantages of skip lists?
1) Logarithmic search time 2) Efficient insertion/deletion 3) Low average memory usage per node
What are the two main disadvantages of skip lists?
1) Higher memory overhead from additional pointers 2) Complex theoretical analysis
How is the maximum level assigned to elements in HNSW?
Using an exponentially decaying probability distribution: ⌊−ln(unif(0 1)) ∗ mL)⌋
What are the two phases of HNSW insertion?
1) Zoom-out: traverse graph to find nearest element 2) Zoom-in: expand search to find more nearest neighbors
How does K-ANN search work in HNSW?
Start from top layer find closest neighbor pass to lower layer and at layer 0 find ef candidates
What are the main limitations of HNSW?
1) Requires processing entire dataset before queries 2) High RAM usage 3) Difficult to implement distributed search
What is SPANN?
Scalable Partitioned Approximate Nearest Neighbor - a hybrid memory-disk indexing system for ANNS
What is the basic structure of SPANN?
Data vectors divided into posting lists each associated with a centroid centroids stored in memory as coarse-grained index
What are the three steps in SPANN’s balanced hierarchical clustering?
1) Initial clustering 2) Iterative subdivision 3) Final assignment
Why does SPANN use multi-cluster assignment?
To address the boundary issue where nearby vectors may not be effectively represented by a single centroid
What are the two main costs of duplication in SPANN?
1) Higher disk usage 2) Potential disk read redundancy