Elastic Search Flashcards
What is a document
The individual units of data being searched over. It is just a JSON object.
{ "id": "XYZ123", "title": "The Great Gatsby", "author": "F. Scott Fitzgerald", "price": 10.99, "createdAt": "2024-01-01T00:00:00.000Z" }
What are indices
A collection of documents. Each document is associated with a unique ID and a set of fields, which are key-value pairs that contain the data you’re searching over.
What are Mappings and fields
Mappings are the schema of the index. Mappings define the fields that the index will have, the data type of each field, and any other properties like how a field is indexed.
An example of a mapping:
{ "properties": { "id": { "type": "keyword" }, "title": { "type": "text" }, "author": { "type": "text" }, "price": { "type": "float" }, "createdAt": { "type": "date" } } }
What is a shard
1:1 with lucene indexes
What is a replica
A replica is an exact copy of a shard. Elasticsearch allows one or more copies of a shard
What is TF-IDF (Term Frequency-Inverse Document Frequency)
It is a measure of importance of a word in a document. The term frequency is the number of times a term occurs in a document. The inverse document frequency is the number of documents the term occurs in.
A term with a high DF might be considered not important, or common. Conversely, a term with a low DF might be considered more important.
How is pagination handled
From/Size Pagination
-
from
: the starting index of results -
size
: the number of results to return - This can be very inefficient for deep pagination (e.g. beyond 10k results) as the results are sorted on every request
Search After Pagination
-
search_after
: use the sort values of the previous result as a starting point for the next page - Ensures you don’t miss any documents you haven’t yet seen
- You must keep state client side and you could miss results that were inserted after your search for previous pages
Cursors
- Create a point in time (PIT), use the PIT in your search query, close the PIT
- This will ensure data is consistent for your query
- Subsequent requests will not have to resort data
- This does use more memory
What are the different node types
- Master Node
- Data Node
- Coordinating Node
- Ingest Node
- Machine Learning Node
What is a master node
- A node responsible for coordinating the cluster
- Can add and remove nodes
- Can create and remove indices
What is a data node
- A data node is responsible for storing the data
- Large clusters will have many data nodes
- Data nodes house indices, which are comprised of shards and their replicas
- Shards are composed of lucene indexes
What is a coordinating node
- A node that is responsible for coordinating search requests across the cluster
- It receives the search request from the client, performs query optimization, and sends it to the appropriate nodes
What is an ingest node
- A node responsible for ingestion of data
- The data is transformed and prepared for indexing
What is a machine learning node
- A node responsible for machine learning tasks
What is a lucene index
- Lucene indexes are made up of segments
- Segments are immutable
- CRUD
- Writes are batched to create new segments
- When segments get too numerous, a merge occurs to merge segments
- Deletions are handled by delete identifiers, entries with a delete identifier is skipped when reading and fully removed on the next merge
- Updates are similar to deleted with the record be re-written with the new data
What are inverted indexes
A type of index that maps the content to the locations. In the example of a document store, it maps the words to the documents they are in so you have O(1) lookup times
What are doc values
A columnar, contiguous representation of a single field for all documents across the segment.
If you wanted to sort by price, the doc value structure can be used after finding the values with the inverted index