Vector Queries Flashcards
Vector Database
A vector database is a type of database specifically designed to store, manage, and retrieve data represented as vectors, which are numerical arrays capturing information in multi-dimensional space.
Key Characteristics of Vector DBs
Data (such as text, images, or audio) is encoded into vectors, typically by machine learning models like embeddings from NLP models or feature vectors from image recognition models.
Each vector represents a “point” in multi-dimensional space, and similar data points (e.g., similar sentences, images) are often close to each other in this space.
Vector databases perform similarity searches using distance metrics, such as cosine similarity, Euclidean distance, or dot product.
Approximate Nearest Neighbor (ANN)
Vector databases use specialized indexing structures, such as Approximate Nearest Neighbor (ANN) indexing techniques, to efficiently handle high-dimensional data and speed up similarity searches.
Examples of ANN algorithms include HNSW (Hierarchical Navigable Small World) and LSH (Locality-Sensitive Hashing).
Syntactical vs. Semantic Search
In a syntactical search, the engine would look for documents containing that exact phrase. If a document doesn’t have the words “apple”, “alcoholic”, and “beverage” in close proximity or in that specific order, it may not be ranked high or even shown in the results. This method is limited because it’s tied strictly to the syntax of the query and can miss out on contextually relevant documents.
In the realm of semantic search, querying for “apple alcoholic beverage” wouldn’t just give you documents containing that exact phrase. It would understand the essence of your query and fetch documents related to “appletini”, “apple brandy”, “apple bourbon”, and more
Why is Vector Search Crucial for Semantic Search?
Words, phrases, or even entire sentences can be represented as vectors in a high-dimensional space. In this vector space, the “distance” between vectors indicates semantic similarity. Words or phrases with similar meanings will have vectors closer to each other.
dense_vector
Elasticsearch’s dense_vector datatype is designed to store vectors of float values. These vectors are often employed in machine learning, especially for embeddings where items are represented as vectors in high-dimensional space.
To store a vector, you can define a mapping like:
{
“properties”: {
“text-vector”: {
“type”: “dense_vector”,
“dims”: 512
}
}
}
Here, dims denotes the number of dimensions in the vector.
script_score
To perform vector similarity searches, we need to measure how close a given vector is to other vectors in the database. A common method for this is to compute the dot product between vectors. The script_score function in Elasticsearch allows us to compute custom scores for documents based on a script. By employing this functionality, we can compute the dot product between our query vector and the vectors stored in our database.
{
“query”: {
“script_score”: {
“query”: {
“match_all”: {}
},
“script”: {
“source”: “dotProduct(params.queryVector, ‘text-vector’) + 1.0”,
“params”: {
“queryVector”: […]
}
}
}
}
}
Here, params.queryVector is the vector you’re searching with, and ‘text-vector’ refers to the field in which the vectors are stored.
Imp Link
https://www.elastic.co/search-labs/blog/elastic-vector-database-practical-example