Mappings and text analysis Flashcards
How do you set the mappings of an index and what is the structure?
PUT < index > { mappings: { properties: { < field > : { type: < type > }
How do you set the mappings of an index and what is the structure?
PUT < index > { mappings: { properties: { < field > : { type: < type > }
What’s the difference between keyword and text types?
- Text is analysed, broken down into individual tokens.
- Keyword is store as is, as a full token. It’s not analysed.
Name the top elastic data types and its applications.
Numerical
integer, short, long
Floating point
float, double, scaled_float
text
text, keyword
Specific purpose
geo_point
ip
date (stored in utc).
Other
boolean
How are date types stored in elastic?
Date is always stored as utc.
What is a text analyser in elastic?
It’s a way to process a string and break it down into token that are used for indexing and searching.
How do you use the analyse api and what is it for?
Use is for testing analyser outputs given an input.
POST _analyze { analyzer: "standard", text: "The 3 QUICK BRown-fox jumped". }
How does the “standard” analyser works?
- breaks down hyphens
- keeps apostrophes (doesn’t assume the text is of any specific language)
- lower cases the tokens
How does the “english” analyser work?
- downcase tokens
- remove english stop words (THE, of, etc…)
- convert words into their base form (stemming)
What is a STOP WORD?
Common words that are not relevant for searching like “the”, “of”, etc.
What is stemming?
The process of converting a word into its base form, example: “jumped” -> “jump”.
What is the “simple” analyser?
- splits any non digit letters and punctuation (space, -, ‘, etc)
- downcase the words
What is the “whitespace” analyser?
- DOESN’T lowercase. keeps the case
- Only splits by white spaces
- Keeps punctuaction.
What are the 3 components of an analyser?
- Token filters
- Character filters
- Tokenizers
How do you define an analyser?
In the settings section of the index:
PUT < index > { settings: { analysis: { analyser: { "< new analyser name" : { type: "...", tokenizer: "< tokenizer > ", filter: [" < token filter name > "], char_filter : [" < character filter > " } }
TODO: what is the type in the analyser?