VectorSearch Flashcards

Question 1

Q

What is K-Approximate Nearest Neighbor (K-ANN)?

Answer

A

Algorithm to find K nearest elements to a query vector in a large dataset

Question 2

Q

Why are approximate algorithms (ANNS) used instead of exact methods?

Answer

A

Because exact methods become computationally expensive as dataset size and vector dimensionality increase

Question 3

Q

What is HNSW?

Answer

A

Hierarchical Navigable Small World - an ANNS algorithm offering logarithmic search complexity based on navigable small-world graphs

Question 4

Q

What is a skip list?

Answer

A

A probabilistic data structure using additional levels of pointers to speed up search insertion and deletion in an ordered list

Question 5

Q

What are the three main advantages of skip lists?

Answer

A

1) Logarithmic search time 2) Efficient insertion/deletion 3) Low average memory usage per node

Question 6

Q

What are the two main disadvantages of skip lists?

Answer

A

1) Higher memory overhead from additional pointers 2) Complex theoretical analysis

Question 7

Q

How is the maximum level assigned to elements in HNSW?

Answer

A

Using an exponentially decaying probability distribution: ⌊−ln(unif(0 1)) ∗ mL)⌋

Question 8

Q

What are the two phases of HNSW insertion?

Answer

A

1) Zoom-out: traverse graph to find nearest element 2) Zoom-in: expand search to find more nearest neighbors

Question 9

Q

How does K-ANN search work in HNSW?

Answer

A

Start from top layer find closest neighbor pass to lower layer and at layer 0 find ef candidates

Question 10

Q

What are the main limitations of HNSW?

Answer

A

1) Requires processing entire dataset before queries 2) High RAM usage 3) Difficult to implement distributed search

Question 11

Q

What is SPANN?

Answer

A

Scalable Partitioned Approximate Nearest Neighbor - a hybrid memory-disk indexing system for ANNS

Question 12

Q

What is the basic structure of SPANN?

Answer

A

Data vectors divided into posting lists each associated with a centroid centroids stored in memory as coarse-grained index

Question 13

Q

What are the three steps in SPANN’s balanced hierarchical clustering?

Answer

A

1) Initial clustering 2) Iterative subdivision 3) Final assignment

Question 14

Q

Why does SPANN use multi-cluster assignment?

Answer

A

To address the boundary issue where nearby vectors may not be effectively represented by a single centroid

Question 15

Q

What are the two main costs of duplication in SPANN?

Answer

A

1) Higher disk usage 2) Potential disk read redundancy

Question 16

Q

What are the main advantages of SPANN?

Answer

Study These Flashcards

A

High scalability low query latency reduced memory cost

Question 17

Q

What are the main limitations of SPANN?

Answer

Study These Flashcards

A

High disk cost due to data replication data freshness not guaranteed by default

Question 18

Q

How does SPANN determine which posting lists to fetch during search?

Answer

Study These Flashcards

A

It finds the closest centroid to query and selects lists whose centroids are almost as close

Question 19

Q

What is the purpose of the RNG rule in SPANN?

Answer

Study These Flashcards

A

To reduce similarity between neighboring posting lists ensuring duplicated vectors are distributed uniformly

Question 20

Q

What is the boundary issue in SPANN?

Answer

Study These Flashcards

A

When nearby vectors may not be effectively represented by the centroid of a single list

VectorSearch Flashcards

(20 cards)