- find the most relevant document that has your search terms process - know of existence of the document i. e. web crawler, get documents and create a massive corpus of docs - index the document i.e. every document is parsed and tokenized and then indexed individual terms extracted and stored in a data structure called inverted index - know how relevant the document is compared to the search terms i. e. based on the search term every document will have a relevance score - ability to retrieve the document

step #1 - split words on space - all words are normalized - all are lowercased - remove punctuations step #2 - calculate the frequency of the word in the corpus i.e multiple docs make a corpus use can use different analyzers to parse documents

Elasticsearch Flashcards by Narayan Narvekar

what is search

find the most relevant document that has your search terms

process

know of existence of the document
i. e. web crawler, get documents and create a massive corpus of docs
index the document
i.e. every document is parsed and tokenized and then indexed
individual terms extracted and stored in a data structure called inverted index
know how relevant the document is compared to the search terms
i. e. based on the search term every document will have a relevance score
ability to retrieve the document

How well did you know this?

Not at all

Perfectly

Inverted Index

data structure which hold a mapping between the term and the document to which it is found

How well did you know this?

Not at all

Perfectly

Parsing Steps

step #1

split words on space
all words are normalized
all are lowercased
remove punctuations

step #2
- calculate the frequency of the word in the corpus i.e multiple docs make a corpus

use can use different analyzers to parse documents

How well did you know this?

Not at all

Perfectly

Creating Inverted Index (Postings list - in search jargon)

words frequency document

docs1
{id: 1, words: “winter is coming”}

doc2
{id: 2, words: “it snows in winter”}

doc3
{id: 3, words: “I love hot chocolate in winter”}

winter 3 1, 2, 3
is 1 1
coming 1 1
it 1 1
snows 1 1
in 2 2, 3
i 1 3
love 1 3
hot 1 3
chocolate 1 3

index is sorted on words

Search                   Results doc ids
==========================
winter                    1, 2, 3
hot                          3
snows || hot             1, 3
snows &amp;&amp; hot          None

if you want to search for words ending with LATE, then we reverse all words in the index and try to match words starting with LATE
i.e. chocolate => ETALOCOHC
so LATE becomes ETAL

search based on substrings => use ngram analysis

How well did you know this?

Not at all

Perfectly

Ngram analysis

YOURS =>
yo
you
your
ou
our
ours
ur
urs
rs

How well did you know this?

Not at all

Perfectly

What is Elasticsearch

used apache lucene under the hood
provide search capabilities
provides analytics capability such as aggregates etc
distributed i.e. can scale to thousands of nodes
HA and fault tolerant i.e. multiple copies of your data are stored in the cluster (every index is replicated)
simple CRUD using REST api
powerful query DSL
schema less - no schema needed for docs

How well did you know this?

Not at all

Perfectly

Commands

./bin/elasticsearch -Ecluster.name=foo -Enode.name=node1

How well did you know this?

Not at all

Perfectly

Schema

types => logical groupings of documents
index => made up of different document types

example
1. blog engine
types = blog post => {title, content, date}
type comment => {user, content, date, }

ecommerce site
indexes => catalog, customers, inventory

How well did you know this?

Not at all

Perfectly

Sharding and Replication

one index might not be able to hold all the data

sharding - process of splitting the index onto multiple nodes or physical machines i.e. every node will only have a subset of your data

searches on sharded index run in parallel
every shard needs to have a corresponding replica.
a replica cannot be present on the same node as the primary shard
shard can have 0 to n replicas
by default an index in ES has 5 shards and 1 replica i.e. every shard has one backup copy

How well did you know this?

Not at all

Perfectly

Healthcheck

Get cluster health
http://localhost:9200/_cat/health?&pretty

cluster status
GREEN - all shards and replicas available for search
YELLOW - some replicas not available for query
RED - some shards of certain indexes not available

Get node health
http://localhost:9200/_cat/nodes?&pretty
ip, heap, ram, cpu, load_1m, load_5m, load_15, role, master, node_name

How well did you know this?

Not at all

Perfectly

Creating Indexes and documents

create new index
curl -XPUT localhost:9200/products

get index info
http://localhost:9200/_cat/indices?&pretty
health, status, index, pri, rep, docs.count, size

if running on single machine, health will be Yellow

Add new product

curl -XPUT localhost:9200/products/mobiles/1 -d {
  "name": "",
  "storage": ""
  "reviews": ["foo", "bar"]
}

response
{
 "_index": "products",
 "_type": "mobiles",
 "_id": 1,
"_version": 1,
"_shards": {
   "total": 2
  }
}

to add laptops
curl -XPUT localhost:9200/products/laptops/ -d {
“name”: “”,
“storage”: “”
“reviews”: [“foo”, “bar”]
}
here ES will autogenerate ID if none is passed.

How well did you know this?

Not at all

Perfectly

Retrieving documents

http://locahost:9200/products/phones/1

response
{
 "_index": "products",
 "_type": "mobiles",
 "_id": 1,
"_version": 1,
_source": {
    "name": "",
    "storage": ""
    "reviews": ["foo", "bar"]
 }
}

localhost:9200/products/phones/1&_source=false

{
 "_index": "products",
 "_type": "mobiles",
 "_id": 1,
"_version": 1
}

localhost:9200/products/phones/1&_source=name

{
 "_index": "products",
 "_type": "mobiles",
 "_id": 1,
"_version": 1,
_source": {
    "name": ""
 }
}

How well did you know this?

Not at all

Perfectly

Updating docs

Full update

curl -XPUT localhost:9200/products/mobiles/1 -d {
  "name": "",
  "storage": ""
  "reviews": ["foo", "bar"]
}

response
{
 "_index": "products",
 "_type": "mobiles",
 "_id": 2, <<<<

How well did you know this?

Not at all

Perfectly

Delete doc

curl -XDELETE localhost:9200/products/mobiles/1

check if doc exists
curl -i -XHEAD localhost:9200/products/mobiles/1
response: 404

delete index
curl -XDELETE localhost:9200/products

How well did you know this?

Not at all

Perfectly

Bulk Operations

Method #1

curl -XPOST localhost:9200/_bulk -d
{“index”: {“_index”: “products”, “_type”: “mobiles”, “_id”: 10}}
{“name”: “foo”, “storage”: “”, “reviews”: “”}
{“index”: {“_index”: “products”, “_type”: “mobiles”, “_id”: 11}}
{“name”: “bar”, “storage”: “”, “reviews”: “”}

line 1 -> index and id info
line 2 -> data

Method #2
==
create a .json file with all data
data.json
{"index": {}}
{"name": "foo", "storage": "", "reviews": ""}
{"index": {}}
{"name": "bar", "storage": "", "reviews": ""}

curl -H “”
-XPOST
localhost:9200/products/mobiles/_bulk
data-binary @”data.json”

How well did you know this?

Not at all

Perfectly

Search Query

Study These Flashcards

there are two contexts in ES query

query context - how relevant are the results
filter context - does the query terms satisfy certain criterion or not

===
search all documents in customers index for word=foo in any field
localhost:9200/customers/q=foo

===
sort on age desc
localhost:9200/customers?q=foo&sort=age:desc

====
state florida and return 2 results from the 10
localhost:9200/customers?q=state:florida&from=10&size=2

===
returns all documents, also no relevance score is calculate and all docs will have score=1.0

localhost:9200/customers/_search -d
{
“query”: {“match_all”: {}}
}

===

localhost:9200/customers/_search -d 
{
  "query": {"match_all": {}},
 "sort": {"age": {"order": "desc"}},
 "from": 5
 "size": 2
}

Term Searches

Study These Flashcards

search for docs that contains foo in the name field

localhost:9200/customers/_search -d 
{
  "source": false,
  "query": {
     "term": {
     "name": "foo"
   }
  }
}

response:

{
“hits”:{ “total”: 2, max_score: 4.1, “hits”: []}
}

source filtering doe snot effect relevance ranking

localhost:9200/customers/_search -d 
{
  "source": [
   "includes": ["name"],
   "excludes": ["description"]
  ],
  "query": {"term": {
     "name": "foo"
   }
  }
}

Full Text Queries

Study These Flashcards

match with options

match
match_phrase - want to match entire phrase not just one word
match_phrase_prefix - search for a prefix

simple match
==
search for foo in name - but this is full text search not just the word. It depends on how the field has been analyzed - so it takes care of caps etc
localhost:9200/customers/_search -d 
{
  "query": {
     "match": {
        "name": "foo"
   }
  }
}

match name field if either foo or bar exists

localhost:9200/customers/_search -d 
{
  "query": {
     "match": {
        "name": {
          "query": "foo bar",
          "operator": "or"
      }
   }
  }
}

default operator is OR

match with prefix
==
search for all names starting with F
localhost:9200/customers/_search -d 
{
  "query": {
     "match": {
        "name": "f"
   }
  }
}

Boolean Query

Study These Flashcards

MUST

must - search docs should contain all words i.e. like AND
should - search docs may or may not contain all query words i.e. like OR
must_not - search docs must not contain any of the query words i.e. like NOT

find docs which MUST have street address as magnolia bridge. This normally gives us less results.

localhost:9200/customers/_search -d 
{
  "query": {
   "bool": {
      "must": [
       "match": { "street": "magnolia" },
       "match": { "street": "bridge" }
    ]
  }
  }
}

find docs which could have street address as magnolia bridge. Both terms might not be present. This gives us more results

localhost:9200/customers/_search -d 
{
  "query": {
   "bool": {
      "must": [
       "match": { "street": "magnolia" },
       "match": { "street": "bridge" }
    ]
  }
  }
}

Boosted Term Search

Study These Flashcards

==
find docs which have state as CA or FL but boost all CA docs with a factor of 2 - so they will appear higher in search.

localhost:9200/customers/_search -d 
{
  "query": {
   "bool": {
      "should": [
       "term": { "state": {"value": "CA", "boost": 2} },
       "term": { "state": {"value": "FL"} },
    ]
  }
  }
}

Query with filter + bool

Study These Flashcards

range query
==
localhost:9200/customers/_search -d 
{
  "query": {
   "bool": {
      "must": {"match_all": {}}
   "filter": {
      "range": {"age": {"gte": 20, "lte": 30}}
     }
  }
  }
}

this will not assign any _score to the results since its not a search query and only a range query with filter

search query with filter
==
find female greater than age 20 in CA
localhost:9200/customers/_search -d 
{
  "query": {
   "bool": {
      "must": [
       "term": { "state": {"value": "CA"} }
    ],
   "filter": {
     "term": {"gender": "female"},
      "range": {"age": {"gte": 20}}
     }
  }
  }
}

this query will have _score field on each document since its a search query

Aggregations in ES

Study These Flashcards

Metric
- sum, average, min, max, count etc

Bucketing
- logically group docs based on search

Matrix
Pipeline

Metric agg - average

Study These Flashcards

combine aggs with query

localhost:9200/customers/_search -d 
{
  "size": 0,
   "aggs": {
     "average_age": {
       "avg": {
          "field": "age"
       }
    }
   }
}

size =0 indicates that we do not want any docs retrieved

find average age of all residents of CA

localhost:9200/customers/_search -d 
{
  "size": 0,
  "query": {
   "bool": {
    "filter": {
       "match": {
          "state": "CA"
       }
     }
   }
  }
   "agg": {
     "average_age": {
       "avg": {
          "field": "age"
       }
    }
 }
}

Stats

Study These Flashcards

localhost:9200/customers/_search -d 
{
  "size": 0,
   "stats": {
     "age_stats": {
       "stage": {
          "field": "age"
       }
    }
   }
}

this calculates all the stats for field = age
- count, min, max, avg, sum

Cardinality agg

``` localhost:9200/customers/_search -d { "size": 0 "aggs": { "age_count": { "cardinality": { "field": "age" } } } } ``` gives all unique values for age across whole index

Bucketing agg

``` localhost:9200/customers/_search -d { "size": 0 "aggs": { "gender_group": { "terms": { "field": "gender" } } } } ``` get all counts based on gender i.e. male: 10, female:30

Configuring ES cluster

- install JDK on three machines - install ES on all machines elasticsearch. yaml node. name = name of the cluster node.master = T/F to make sure this is eligible for master OR node.data = T/F network. host = IP address of this node which ES will bind discovery. zen.ping.unicast.hosts = [ip1, ip2] => IP addresses of all other nodes in the cluster discovery. zen.mininum_master_nodes:

Sharding

- each index by default has 5 shards and 1 replica for each shard - shards cannot be changed after index creation - replicas can be changed after index creation ``` curl -XPUT localhost:9200/myindex -d { "settings": { "number_of_shards": 2, "replicas": 0 } } ``` 1 lucene index = 1 ES shard 2 shards + 2 replicas = 6 total shards i.e. primary shard * (replica + 1) = 2 * (2 + 1) = 6 5 shards + 1 replica = 5 * (1 + 1) = 10 5 shards + 2 replica = 5 * (2 + 1) = 15

Routing Parameters

curl -XPUT localhost:9200/myindex/customers/1?routing=A { "name": "foo" } it makes sure that this document is stored in a shard and associates routing parameter A with it. All other docs with the same routing parameter will be part of this shard ``` Bulk == curl -XPUT localhost:9200/myindex/customers/_bulk {"_index": {"_id": 1, "_routing": "A"}} {"name": "foo"} {"_index": {"_id": 1, "_routing": "A"}} {"name": "bar"} {"_index": {"_id": 1, "_routing": "B"}} {"name": "baz"} ``` foo and bar with be on shard with routing A baz will be on shard with routing B

ES Similarity Models

A similarity (scoring / ranking model) defines how matching documents are scored. Similarity is per field, meaning that via the mapping one can define a different similarity per field. ``` curl -XPUT localhost:9200/index/_settings { "index": { "similarity": { "default": "bm25" OR "boolean" } } } ``` - you can specify similarity models for each field

String field configuration

1. full text search | 2. keyword search

Analyzers

steps 1. Tokenize - split text based on space, remove punctuations, 2. Normalize - stemming - go to the root of the word i.e. running/ran => run - add synonyms to the inverted index - lowercase every word - variations of each word - with spelling mistakes etc Builtin Analyzers - standard - simple - whitespace

Difference between Term queries and Match queries

Term == - looks for exact term in inverted index - used for exact match situation i.e. numbers, dates etc - mostly gets fewer results but most of the results are relevant - case sensitive ``` Match == - query is passed through the analyzer first - mostly used for full text search - makes use of stemming and normalization - more likely to match irrelevant docs - no support for wildcards etc - case insensitive ```

Term v/s Match queries

``` create a .json file with all data data.json {"index": {}} {"name": "John Smith": ""} {"index": {}} {"name":"Jane Smith"} ``` curl -H "" -XPOST localhost:9200/products/customers/_bulk data-binary @"data.json" ``` == localhost:9200/customers/_search -d { "query" : { "TERM": { "name": "John Smith"} } } ``` NO RESULTS ``` localhost:9200/customers/_search -d { "query" : { "TERM": { "name": "John"} } } ``` NO RESULTS note: term query is not analyzed and it is searching for a field "name" which is analyzed. So search is case sensitive and will not match keywords "John Smith" or upper case J in case of "John" ``` localhost:9200/customers/_search -d { "query" : { "MATCH": { "name": "John"} } } ``` WILL GET RESULTS - since match query is analyzed so it will search for "john" name is a string and hence is mapped to text field and then keyword field by default to search for keywords you need to specify new mapping for "name" field i.e. mappings: { name: {type: keyword}} now the name field will not be passed through the analyzer so the above TERM queries will fetch results.

Mappings for fields

``` { "settings": "analysis": {} "mappings": {} } ``` ``` === { "settings": "analysis": { "analyzer": { "sample_analyzer": {"type": "custom", "tokenizer": "keyword", "filter": ["lowercase"]} } } "mappings": { "foo": { "properties": { "name": { "type": "keyword", "analyzer": "sample_analyzer" <<< ```

Elasticsearch Flashcards

(35 cards)