Elastic Search Flashcards

1
Q

What is a document

A

The individual units of data being searched over. It is just a JSON object.

{
  "id": "XYZ123",
  "title": "The Great Gatsby",
  "author": "F. Scott Fitzgerald",
  "price": 10.99,
  "createdAt": "2024-01-01T00:00:00.000Z"
}
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are indices

A

A collection of documents. Each document is associated with a unique ID and a set of fields, which are key-value pairs that contain the data you’re searching over.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Mappings and fields

A

Mappings are the schema of the index. Mappings define the fields that the index will have, the data type of each field, and any other properties like how a field is indexed.

An example of a mapping:

{
  "properties": {
    "id": { "type": "keyword" },
    "title": { "type": "text" },
    "author": { "type": "text" },
    "price": { "type": "float" },
    "createdAt": { "type": "date" }
  }
}
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a shard

A

1:1 with lucene indexes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a replica

A

A replica is an exact copy of a shard. Elasticsearch allows one or more copies of a shard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is TF-IDF (Term Frequency-Inverse Document Frequency)

A

It is a measure of importance of a word in a document. The term frequency is the number of times a term occurs in a document. The inverse document frequency is the number of documents the term occurs in.

A term with a high DF might be considered not important, or common. Conversely, a term with a low DF might be considered more important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is pagination handled

A

From/Size Pagination

  • from: the starting index of results
  • size: the number of results to return
  • This can be very inefficient for deep pagination (e.g. beyond 10k results) as the results are sorted on every request

Search After Pagination

  • search_after: use the sort values of the previous result as a starting point for the next page
  • Ensures you don’t miss any documents you haven’t yet seen
  • You must keep state client side and you could miss results that were inserted after your search for previous pages

Cursors

  • Create a point in time (PIT), use the PIT in your search query, close the PIT
  • This will ensure data is consistent for your query
  • Subsequent requests will not have to resort data
  • This does use more memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the different node types

A
  • Master Node
  • Data Node
  • Coordinating Node
  • Ingest Node
  • Machine Learning Node
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a master node

A
  • A node responsible for coordinating the cluster
  • Can add and remove nodes
  • Can create and remove indices
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a data node

A
  • A data node is responsible for storing the data
  • Large clusters will have many data nodes
  • Data nodes house indices, which are comprised of shards and their replicas
  • Shards are composed of lucene indexes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a coordinating node

A
  • A node that is responsible for coordinating search requests across the cluster
  • It receives the search request from the client, performs query optimization, and sends it to the appropriate nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is an ingest node

A
  • A node responsible for ingestion of data
  • The data is transformed and prepared for indexing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a machine learning node

A
  • A node responsible for machine learning tasks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a lucene index

A
  • Lucene indexes are made up of segments
  • Segments are immutable
  • CRUD
    • Writes are batched to create new segments
    • When segments get too numerous, a merge occurs to merge segments
    • Deletions are handled by delete identifiers, entries with a delete identifier is skipped when reading and fully removed on the next merge
    • Updates are similar to deleted with the record be re-written with the new data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are inverted indexes

A

A type of index that maps the content to the locations. In the example of a document store, it maps the words to the documents they are in so you have O(1) lookup times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are doc values

A

A columnar, contiguous representation of a single field for all documents across the segment.

If you wanted to sort by price, the doc value structure can be used after finding the values with the inverted index