VectorSearch Flashcards
What is K-Approximate Nearest Neighbor (K-ANN)?
Algorithm to find K nearest elements to a query vector in a large dataset
Why are approximate algorithms (ANNS) used instead of exact methods?
Because exact methods become computationally expensive as dataset size and vector dimensionality increase
What is HNSW?
Hierarchical Navigable Small World - an ANNS algorithm offering logarithmic search complexity based on navigable small-world graphs
What is a skip list?
A probabilistic data structure using additional levels of pointers to speed up search insertion and deletion in an ordered list
What are the three main advantages of skip lists?
1) Logarithmic search time 2) Efficient insertion/deletion 3) Low average memory usage per node
What are the two main disadvantages of skip lists?
1) Higher memory overhead from additional pointers 2) Complex theoretical analysis
How is the maximum level assigned to elements in HNSW?
Using an exponentially decaying probability distribution: ⌊−ln(unif(0 1)) ∗ mL)⌋
What are the two phases of HNSW insertion?
1) Zoom-out: traverse graph to find nearest element 2) Zoom-in: expand search to find more nearest neighbors
How does K-ANN search work in HNSW?
Start from top layer find closest neighbor pass to lower layer and at layer 0 find ef candidates
What are the main limitations of HNSW?
1) Requires processing entire dataset before queries 2) High RAM usage 3) Difficult to implement distributed search
What is SPANN?
Scalable Partitioned Approximate Nearest Neighbor - a hybrid memory-disk indexing system for ANNS
What is the basic structure of SPANN?
Data vectors divided into posting lists each associated with a centroid centroids stored in memory as coarse-grained index
What are the three steps in SPANN’s balanced hierarchical clustering?
1) Initial clustering 2) Iterative subdivision 3) Final assignment
Why does SPANN use multi-cluster assignment?
To address the boundary issue where nearby vectors may not be effectively represented by a single centroid
What are the two main costs of duplication in SPANN?
1) Higher disk usage 2) Potential disk read redundancy
What are the main advantages of SPANN?
High scalability low query latency reduced memory cost
What are the main limitations of SPANN?
High disk cost due to data replication data freshness not guaranteed by default
How does SPANN determine which posting lists to fetch during search?
It finds the closest centroid to query and selects lists whose centroids are almost as close
What is the purpose of the RNG rule in SPANN?
To reduce similarity between neighboring posting lists ensuring duplicated vectors are distributed uniformly
What is the boundary issue in SPANN?
When nearby vectors may not be effectively represented by the centroid of a single list