Analysis Flashcards

Question 1

Q

Analyzer

Answer

A

When we insert a text document into Elasticsearch, it won’t save the text as it is. The text will go through an analysis process performed by an analyzer. In the analysis process, an analyzer will first transform and split the text into tokens before saving it to the inverted index.

Question 2

Q

Tokenization

Answer

A

This process is carried out by a component called a tokenizer, whose sole job is to chop the content into individual words called tokens

Question 3

Q

Normalization

Answer

A

Normalization – is a process where the tokens (words) are transformed, modified, and enriched in the form of stemming, synonyms, stop words, and other features.

Question 4

Q

Stemming

Answer

A

Stemming is an operation where the words are reduced (stemmed) to their root words: for example, “game” is a root word for “gaming”, “gamer” including the plural form “gamers”.

Question 5

Q

Analyzer

Answer

A

An analyzer module consists of essentially three components – character filters, the tokenizer and token filters

These three components form a pipeline that each of the text fields pass through for text processing

https://lh6.googleusercontent.com/0jGKbAGn59wry9DUzgXBxOxwdeBu577SkLpa7xYiKMY6mgTv5ZYg9ZZEunTzb-xwXwUEyg-yIfg0qyd7tmH4NZo_e26fLT0G-mhJh14upMUf7KnN-NJr9xECoNl_vS9PcxQRrq6_

Question 6

Q

Character Filter

Answer

A

The character filter’s job is to remove unwanted characters from the input text string.

Elasticsearch provides three-character filters out of the box: html_strip, mapping and pattern_replace.

Question 7

Q

Tokenizer

Answer

A

The tokenizers split the body of text into words by using a delimiter such as whitespace, punctuation, or some form of word boundaries. For example, if the phrase “Opster ops is AWESOME!!” is passed through a tokenizer (we consider a standard tokenizer for now), the result will be a set of tokens: “opster”, “ops”, “is”, “awesome”.

The tokenizer is a mandatory component of the pipeline – so every analyzer must have one, and only one, tokenizer.

In addition to the standard tokenizer, there are a handful of off-the-shelf tokenizers: standard, keyword, N-gram, pattern, whitespace, lowercase and a handful of other tokenizers.

Question 8

Q

Token Filters

Answer

A

Token filters are optional. They can either be zero or many, associated with an analyzer module.

Question 9

Q

Analyzer Process Example

Answer

A

For example, inserting “Let’s build an Autocomplete!” to Elasticsearch will transform the text into four terms: “let’s,” “build,” “an,” and “autocomplete.”
check diagram here
https://miro.medium.com/v2/resize:fit:720/format:webp/0*ylGIph81hHBUVrrW.png

The analyzer will affect how we search the text, but it won’t affect the content of the text itself. With the previous example, if we search for “let”, Elasticsearch will still return the full text “Let’s build an autocomplete!” instead of only “let.”

Elasticsearch’s analyzer has three components you can modify depending on your use case:

Character filters
Tokenizer
Token filters

Question 10

Q

Character Filter

Answer

A

The first process that happens in the analysis process is character filtering, which removes, adds, and replaces the characters in the text.

There are three built-in character filters in Elasticsearch

HTML strip character filters: Will strip out HTML tags and characters like , , <div>, , etc.

Mapping character filters: This filter will let you map a term into another term. For example, if you want to make the user search an emoji, you can map “:)” to “smile.”

Pattern replace character filter: Will replace a regular expression pattern into another term. Be careful, though. Using a pattern replace character filter will slow down your document indexing process.

Question 11

Q

Tokenizer

Answer

A

Tokenization splits your text into tokens. For example, we previously transformed “Let’s build an autocomplete” to “let’s,” “build,” “an,” and “autocomplete.”

Commonly used tokenizers

Standard tokenizer: Elasticsearch’s default tokenizer. It will split the text by whitespace and punctuation.

Whitespace tokenizer: A tokenizer that splits the text by only whitespace.

Edge N-Gram tokenizer: Really useful for creating an autocomplete. It will split your text by whitespace and characters in your word (e.g. Hello -> “H,” “He,” “Hel,” “Hell,” “Hello.”).

Question 12

Q

Token Filter

Answer

A

Token filtering is the third and final process in the analysis. This process will transform the tokens depending on the token filter we use. In the token filtering process, we can lowercase, remove stop words, and add synonyms to the terms.

The most common usage of token filters is a lowercase token filter that will lowercase all your tokens.

Question 13

Q

Standard Analyzer

Answer

A

The standard analyzer uses:

A standard tokenizer
A lowercase token filter
A stop token filter (disabled by default)
So with those components, it basically does the following:

Tokenizes the text into tokens by whitespace and punctuation.
Lowercases the tokens.
If you enable the stop token filter, it will remove stop words.

Example Let’'’s learn about Analyzer!

would get tokenized into
let’s learn about analyzer

We can see that a standard analyzer splits the text into tokens by whitespace. It also removes the punctuation mark “!”

We can see that all the tokens are lowercased because the standard analyzer uses a lowercase token filter.

Question 14

Q

Custom Analyzer

Answer

A

curl –request PUT \
–url http://localhost:9200/autocomplete-custom-analyzer \
–header ‘Content-Type: application/json’ \
–data ‘{
“settings”: {
“analysis”: {
“analyzer”: {
“cust_analyzer”: {
“type”: “custom”,
“tokenizer”: “whitespace”,
“char_filter”: [
“html_strip”
],
“filter”: [
“lowercase”
]
}
}
}
}
}’

Question 15

Q

Standard Analyzer vs Our Custom Analyzer

Answer

A

Input text
Let’s build an autocomplete!

We can see some differences between them:

The results of the standard analyzer have two b tokens, while the cust_analyzer does not. This happens because the cust_analyzer strips away the HTML tag completely.
The standard analyzer splits the text by whitespace or special characters like <, >, and !, while the cust_analyzer only splits the text by whitespace.
The standard analyzer strips away special characters, while the cust_analyzer does not. We can see the difference in the autocomplete! token.

Question 16

Q

Answer

Study These Flashcards

A

Analysis Flashcards

(16 cards)