Analysis Flashcards

1
Q

Analyzer

A

When we insert a text document into Elasticsearch, it won’t save the text as it is. The text will go through an analysis process performed by an analyzer. In the analysis process, an analyzer will first transform and split the text into tokens before saving it to the inverted index.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Tokenization

A

This process is carried out by a component called a tokenizer, whose sole job is to chop the content into individual words called tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Normalization

A

Normalization – is a process where the tokens (words) are transformed, modified, and enriched in the form of stemming, synonyms, stop words, and other features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Stemming

A

Stemming is an operation where the words are reduced (stemmed) to their root words: for example, “game” is a root word for “gaming”, “gamer” including the plural form “gamers”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Analyzer

A

An analyzer module consists of essentially three components – character filters, the tokenizer and token filters

These three components form a pipeline that each of the text fields pass through for text processing

https://lh6.googleusercontent.com/0jGKbAGn59wry9DUzgXBxOxwdeBu577SkLpa7xYiKMY6mgTv5ZYg9ZZEunTzb-xwXwUEyg-yIfg0qyd7tmH4NZo_e26fLT0G-mhJh14upMUf7KnN-NJr9xECoNl_vS9PcxQRrq6_

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Character Filter

A

The character filter’s job is to remove unwanted characters from the input text string.

Elasticsearch provides three-character filters out of the box: html_strip, mapping and pattern_replace.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Tokenizer

A

The tokenizers split the body of text into words by using a delimiter such as whitespace, punctuation, or some form of word boundaries. For example, if the phrase “Opster ops is AWESOME!!” is passed through a tokenizer (we consider a standard tokenizer for now), the result will be a set of tokens: “opster”, “ops”, “is”, “awesome”.

The tokenizer is a mandatory component of the pipeline – so every analyzer must have one, and only one, tokenizer.

In addition to the standard tokenizer, there are a handful of off-the-shelf tokenizers: standard, keyword, N-gram, pattern, whitespace, lowercase and a handful of other tokenizers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Token Filters

A

Token filters are optional. They can either be zero or many, associated with an analyzer module.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Analyzer Process Example

A

For example, inserting “Let’s build an Autocomplete!” to Elasticsearch will transform the text into four terms: “let’s,” “build,” “an,” and “autocomplete.”
check diagram here
https://miro.medium.com/v2/resize:fit:720/format:webp/0*ylGIph81hHBUVrrW.png

The analyzer will affect how we search the text, but it won’t affect the content of the text itself. With the previous example, if we search for “let”, Elasticsearch will still return the full text “Let’s build an autocomplete!” instead of only “let.”

Elasticsearch’s analyzer has three components you can modify depending on your use case:

Character filters
Tokenizer
Token filters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Character Filter

A

The first process that happens in the analysis process is character filtering, which removes, adds, and replaces the characters in the text.

There are three built-in character filters in Elasticsearch

HTML strip character filters: Will strip out HTML tags and characters like <b>, <i>, <div>, <br></br>, etc.</i></b>

Mapping character filters: This filter will let you map a term into another term. For example, if you want to make the user search an emoji, you can map “:)” to “smile.”

Pattern replace character filter: Will replace a regular expression pattern into another term. Be careful, though. Using a pattern replace character filter will slow down your document indexing process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Tokenizer

A

Tokenization splits your text into tokens. For example, we previously transformed “Let’s build an autocomplete” to “let’s,” “build,” “an,” and “autocomplete.”

Commonly used tokenizers

Standard tokenizer: Elasticsearch’s default tokenizer. It will split the text by whitespace and punctuation.

Whitespace tokenizer: A tokenizer that splits the text by only whitespace.

Edge N-Gram tokenizer: Really useful for creating an autocomplete. It will split your text by whitespace and characters in your word (e.g. Hello -> “H,” “He,” “Hel,” “Hell,” “Hello.”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Token Filter

A

Token filtering is the third and final process in the analysis. This process will transform the tokens depending on the token filter we use. In the token filtering process, we can lowercase, remove stop words, and add synonyms to the terms.

The most common usage of token filters is a lowercase token filter that will lowercase all your tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Standard Analyzer

A

The standard analyzer uses:

A standard tokenizer
A lowercase token filter
A stop token filter (disabled by default)
So with those components, it basically does the following:

Tokenizes the text into tokens by whitespace and punctuation.
Lowercases the tokens.
If you enable the stop token filter, it will remove stop words.

Example Let’'’s learn about Analyzer!

would get tokenized into
let’s learn about analyzer

We can see that a standard analyzer splits the text into tokens by whitespace. It also removes the punctuation mark “!”

We can see that all the tokens are lowercased because the standard analyzer uses a lowercase token filter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Custom Analyzer

A

curl –request PUT \
–url http://localhost:9200/autocomplete-custom-analyzer \
–header ‘Content-Type: application/json’ \
–data ‘{
“settings”: {
“analysis”: {
“analyzer”: {
“cust_analyzer”: {
“type”: “custom”,
“tokenizer”: “whitespace”,
“char_filter”: [
“html_strip”
],
“filter”: [
“lowercase”
]
}
}
}
}
}’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Standard Analyzer vs Our Custom Analyzer

A

Input text
<b>Let’s build an autocomplete!</b>

We can see some differences between them:

The results of the standard analyzer have two b tokens, while the cust_analyzer does not. This happens because the cust_analyzer strips away the HTML tag completely.
The standard analyzer splits the text by whitespace or special characters like <, >, and !, while the cust_analyzer only splits the text by whitespace.
The standard analyzer strips away special characters, while the cust_analyzer does not. We can see the difference in the autocomplete! token.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
A