VectorSearch Flashcards

1
Q

What is K-Approximate Nearest Neighbor (K-ANN)?

A

Algorithm to find K nearest elements to a query vector in a large dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why are approximate algorithms (ANNS) used instead of exact methods?

A

Because exact methods become computationally expensive as dataset size and vector dimensionality increase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is HNSW?

A

Hierarchical Navigable Small World - an ANNS algorithm offering logarithmic search complexity based on navigable small-world graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a skip list?

A

A probabilistic data structure using additional levels of pointers to speed up search insertion and deletion in an ordered list

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three main advantages of skip lists?

A

1) Logarithmic search time 2) Efficient insertion/deletion 3) Low average memory usage per node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two main disadvantages of skip lists?

A

1) Higher memory overhead from additional pointers 2) Complex theoretical analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is the maximum level assigned to elements in HNSW?

A

Using an exponentially decaying probability distribution: ⌊−ln(unif(0 1)) ∗ mL)⌋

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two phases of HNSW insertion?

A

1) Zoom-out: traverse graph to find nearest element 2) Zoom-in: expand search to find more nearest neighbors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does K-ANN search work in HNSW?

A

Start from top layer find closest neighbor pass to lower layer and at layer 0 find ef candidates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the main limitations of HNSW?

A

1) Requires processing entire dataset before queries 2) High RAM usage 3) Difficult to implement distributed search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is SPANN?

A

Scalable Partitioned Approximate Nearest Neighbor - a hybrid memory-disk indexing system for ANNS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the basic structure of SPANN?

A

Data vectors divided into posting lists each associated with a centroid centroids stored in memory as coarse-grained index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the three steps in SPANN’s balanced hierarchical clustering?

A

1) Initial clustering 2) Iterative subdivision 3) Final assignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why does SPANN use multi-cluster assignment?

A

To address the boundary issue where nearby vectors may not be effectively represented by a single centroid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two main costs of duplication in SPANN?

A

1) Higher disk usage 2) Potential disk read redundancy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the main advantages of SPANN?

A

High scalability low query latency reduced memory cost

17
Q

What are the main limitations of SPANN?

A

High disk cost due to data replication data freshness not guaranteed by default

18
Q

How does SPANN determine which posting lists to fetch during search?

A

It finds the closest centroid to query and selects lists whose centroids are almost as close

19
Q

What is the purpose of the RNG rule in SPANN?

A

To reduce similarity between neighboring posting lists ensuring duplicated vectors are distributed uniformly

20
Q

What is the boundary issue in SPANN?

A

When nearby vectors may not be effectively represented by the centroid of a single list