Mappings and text analysis Flashcards
How do you set the mappings of an index and what is the structure?
PUT < index > { mappings: { properties: { < field > : { type: < type > }
How do you set the mappings of an index and what is the structure?
PUT < index > { mappings: { properties: { < field > : { type: < type > }
What’s the difference between keyword and text types?
- Text is analysed, broken down into individual tokens.
- Keyword is store as is, as a full token. It’s not analysed.
Name the top elastic data types and its applications.
Numerical
integer, short, long
Floating point
float, double, scaled_float
text
text, keyword
Specific purpose
geo_point
ip
date (stored in utc).
Other
boolean
How are date types stored in elastic?
Date is always stored as utc.
What is a text analyser in elastic?
It’s a way to process a string and break it down into token that are used for indexing and searching.
How do you use the analyse api and what is it for?
Use is for testing analyser outputs given an input.
POST _analyze { analyzer: "standard", text: "The 3 QUICK BRown-fox jumped". }
How does the “standard” analyser works?
- breaks down hyphens
- keeps apostrophes (doesn’t assume the text is of any specific language)
- lower cases the tokens
How does the “english” analyser work?
- downcase tokens
- remove english stop words (THE, of, etc…)
- convert words into their base form (stemming)
What is a STOP WORD?
Common words that are not relevant for searching like “the”, “of”, etc.
What is stemming?
The process of converting a word into its base form, example: “jumped” -> “jump”.
What is the “simple” analyser?
- splits any non digit letters and punctuation (space, -, ‘, etc)
- downcase the words
What is the “whitespace” analyser?
- DOESN’T lowercase. keeps the case
- Only splits by white spaces
- Keeps punctuaction.
What are the 3 components of an analyser?
- Token filters
- Character filters
- Tokenizers
How do you define an analyser?
In the settings section of the index:
PUT < index > { settings: { analysis: { analyser: { "< new analyser name" : { type: "...", tokenizer: "< tokenizer > ", filter: [" < token filter name > "], char_filter : [" < character filter > " } }
TODO: what is the type in the analyser?
What are some of the most common tokenizers for elastic?
TODO: Check out the documentation.
How do you specify the analyser of a field?
int the mappings properties:
”< field >”: {
type: ...., analyser: "< analyser name >",
How do you define a tokenizer?
In the settings section of the index:
PUT < index > { settings: { analysis: { "filter": { "< tokenizer name > ": { type: "stop", stopwords: "_english_"
TODO: check the documentation on this.
How do you define a new character filter?
In the settings section of the index:
PUT < index > { settings: { analysis: { "char_filter": { "< character filter name > ": { type: "mapping", mappings: [":) => happy", ":( => sad"]
TODO: check the documentation on this.
What is a multi field?
It’s a way to index the same field in different ways, using different analysers.
How do you define a multi field?
{ properties: { "< field >": { type: ...., fields: { "< multi field name >": { type: .... } } } }
How do you reference a multi field in a query?
Use a dot:
for example:
field.subfield
How do setup a field with a nested array (array of objects) ?
In the mappings:
{
< field >: {
type: “nested”
}
What is the problem of not specifying nested arrays?
Arrays of objects are flattened by default.
For example: field: [ {a: 1, b: 2}, {a: 10, b: 20} ]
Effectively becomes:
field. a: [1. 10]
field. b: [2, 20]
So it loses the relationship to the objects and may return confusion search results.
How do you search nested objects?
query: { "nested": { "path": < field >, "query": { ..... < actual query > ....
How do you specify a relationship between objects (like a join table)?
- Use sparingly as it’s not very performant
- Use the “join” type
{ type: "join", relations: { "< parent name >: "< child name >" } }
What’s the limitation of using join fields?
Connected objects need to be indexed in the same shard, so it when indexing you need to specify “?routing=< id of the parent >”, so that the object is routed to the same shard as the parent node.
How do you index the parent and child object of a join document?
{
< field >: {
“name”: < relationship field >”
}
example:
# parent PUT < index >/_doc/< parent id > { "qa": { "name": "question"
# child PUT < index >/_doc/< child id >?routing=< parent id > { "qa": { "name": "answer", "parent": "< parent id >"