Analysis Flashcards
Analyzer
When we insert a text document into Elasticsearch, it won’t save the text as it is. The text will go through an analysis process performed by an analyzer. In the analysis process, an analyzer will first transform and split the text into tokens before saving it to the inverted index.
Tokenization
This process is carried out by a component called a tokenizer, whose sole job is to chop the content into individual words called tokens
Normalization
Normalization – is a process where the tokens (words) are transformed, modified, and enriched in the form of stemming, synonyms, stop words, and other features.
Stemming
Stemming is an operation where the words are reduced (stemmed) to their root words: for example, “game” is a root word for “gaming”, “gamer” including the plural form “gamers”.
Analyzer
An analyzer module consists of essentially three components – character filters, the tokenizer and token filters
These three components form a pipeline that each of the text fields pass through for text processing
https://lh6.googleusercontent.com/0jGKbAGn59wry9DUzgXBxOxwdeBu577SkLpa7xYiKMY6mgTv5ZYg9ZZEunTzb-xwXwUEyg-yIfg0qyd7tmH4NZo_e26fLT0G-mhJh14upMUf7KnN-NJr9xECoNl_vS9PcxQRrq6_
Character Filter
The character filter’s job is to remove unwanted characters from the input text string.
Elasticsearch provides three-character filters out of the box: html_strip, mapping and pattern_replace.
Tokenizer
The tokenizers split the body of text into words by using a delimiter such as whitespace, punctuation, or some form of word boundaries. For example, if the phrase “Opster ops is AWESOME!!” is passed through a tokenizer (we consider a standard tokenizer for now), the result will be a set of tokens: “opster”, “ops”, “is”, “awesome”.
The tokenizer is a mandatory component of the pipeline – so every analyzer must have one, and only one, tokenizer.
In addition to the standard tokenizer, there are a handful of off-the-shelf tokenizers: standard, keyword, N-gram, pattern, whitespace, lowercase and a handful of other tokenizers.
Token Filters
Token filters are optional. They can either be zero or many, associated with an analyzer module.
Analyzer Process Example
For example, inserting “Let’s build an Autocomplete!” to Elasticsearch will transform the text into four terms: “let’s,” “build,” “an,” and “autocomplete.”
check diagram here
https://miro.medium.com/v2/resize:fit:720/format:webp/0*ylGIph81hHBUVrrW.png
The analyzer will affect how we search the text, but it won’t affect the content of the text itself. With the previous example, if we search for “let”, Elasticsearch will still return the full text “Let’s build an autocomplete!” instead of only “let.”
Elasticsearch’s analyzer has three components you can modify depending on your use case:
Character filters
Tokenizer
Token filters
Character Filter
The first process that happens in the analysis process is character filtering, which removes, adds, and replaces the characters in the text.
There are three built-in character filters in Elasticsearch
HTML strip character filters: Will strip out HTML tags and characters like <b>, <i>, <div>, <br></br>, etc.</i></b>
Mapping character filters: This filter will let you map a term into another term. For example, if you want to make the user search an emoji, you can map “:)” to “smile.”
Pattern replace character filter: Will replace a regular expression pattern into another term. Be careful, though. Using a pattern replace character filter will slow down your document indexing process.
Tokenizer
Tokenization splits your text into tokens. For example, we previously transformed “Let’s build an autocomplete” to “let’s,” “build,” “an,” and “autocomplete.”
Commonly used tokenizers
Standard tokenizer: Elasticsearch’s default tokenizer. It will split the text by whitespace and punctuation.
Whitespace tokenizer: A tokenizer that splits the text by only whitespace.
Edge N-Gram tokenizer: Really useful for creating an autocomplete. It will split your text by whitespace and characters in your word (e.g. Hello -> “H,” “He,” “Hel,” “Hell,” “Hello.”).
Token Filter
Token filtering is the third and final process in the analysis. This process will transform the tokens depending on the token filter we use. In the token filtering process, we can lowercase, remove stop words, and add synonyms to the terms.
The most common usage of token filters is a lowercase token filter that will lowercase all your tokens.
Standard Analyzer
The standard analyzer uses:
A standard tokenizer
A lowercase token filter
A stop token filter (disabled by default)
So with those components, it basically does the following:
Tokenizes the text into tokens by whitespace and punctuation.
Lowercases the tokens.
If you enable the stop token filter, it will remove stop words.
Example Let’'’s learn about Analyzer!
would get tokenized into
let’s learn about analyzer
We can see that a standard analyzer splits the text into tokens by whitespace. It also removes the punctuation mark “!”
We can see that all the tokens are lowercased because the standard analyzer uses a lowercase token filter.
Custom Analyzer
curl –request PUT \
–url http://localhost:9200/autocomplete-custom-analyzer \
–header ‘Content-Type: application/json’ \
–data ‘{
“settings”: {
“analysis”: {
“analyzer”: {
“cust_analyzer”: {
“type”: “custom”,
“tokenizer”: “whitespace”,
“char_filter”: [
“html_strip”
],
“filter”: [
“lowercase”
]
}
}
}
}
}’
Standard Analyzer vs Our Custom Analyzer
Input text
<b>Let’s build an autocomplete!</b>
We can see some differences between them:
The results of the standard analyzer have two b tokens, while the cust_analyzer does not. This happens because the cust_analyzer strips away the HTML tag completely.
The standard analyzer splits the text by whitespace or special characters like <, >, and !, while the cust_analyzer only splits the text by whitespace.
The standard analyzer strips away special characters, while the cust_analyzer does not. We can see the difference in the autocomplete! token.