Elasticsearch Flashcards
what is search
- find the most relevant document that has your search terms
process
- know of existence of the document
i. e. web crawler, get documents and create a massive corpus of docs - index the document
i.e. every document is parsed and tokenized and then indexed
individual terms extracted and stored in a data structure called inverted index - know how relevant the document is compared to the search terms
i. e. based on the search term every document will have a relevance score - ability to retrieve the document
Inverted Index
data structure which hold a mapping between the term and the document to which it is found
Parsing Steps
step #1
- split words on space
- all words are normalized
- all are lowercased
- remove punctuations
step #2
- calculate the frequency of the word in the corpus i.e multiple docs make a corpus
- use can use different analyzers to parse documents
Creating Inverted Index (Postings list - in search jargon)
words frequency document
docs1
{id: 1, words: “winter is coming”}
doc2
{id: 2, words: “it snows in winter”}
doc3
{id: 3, words: “I love hot chocolate in winter”}
winter 3 1, 2, 3
is 1 1
coming 1 1
it 1 1
snows 1 1
in 2 2, 3
i 1 3
love 1 3
hot 1 3
chocolate 1 3
- index is sorted on words
Search Results doc ids ========================== winter 1, 2, 3 hot 3 snows || hot 1, 3 snows && hot None
if you want to search for words ending with LATE, then we reverse all words in the index and try to match words starting with LATE
i.e. chocolate => ETALOCOHC
so LATE becomes ETAL
search based on substrings => use ngram analysis
Ngram analysis
YOURS => yo you your ou our ours ur urs rs
What is Elasticsearch
- used apache lucene under the hood
- provide search capabilities
- provides analytics capability such as aggregates etc
- distributed i.e. can scale to thousands of nodes
- HA and fault tolerant i.e. multiple copies of your data are stored in the cluster (every index is replicated)
- simple CRUD using REST api
- powerful query DSL
- schema less - no schema needed for docs
Commands
./bin/elasticsearch -Ecluster.name=foo -Enode.name=node1
Schema
types => logical groupings of documents
index => made up of different document types
example
1. blog engine
types = blog post => {title, content, date}
type comment => {user, content, date, }
- ecommerce site
indexes => catalog, customers, inventory
Sharding and Replication
- one index might not be able to hold all the data
sharding - process of splitting the index onto multiple nodes or physical machines i.e. every node will only have a subset of your data
- searches on sharded index run in parallel
- every shard needs to have a corresponding replica.
- a replica cannot be present on the same node as the primary shard
- shard can have 0 to n replicas
- by default an index in ES has 5 shards and 1 replica i.e. every shard has one backup copy
Healthcheck
Get cluster health
http://localhost:9200/_cat/health?&pretty
cluster status
GREEN - all shards and replicas available for search
YELLOW - some replicas not available for query
RED - some shards of certain indexes not available
Get node health
http://localhost:9200/_cat/nodes?&pretty
ip, heap, ram, cpu, load_1m, load_5m, load_15, role, master, node_name
Creating Indexes and documents
create new index
curl -XPUT localhost:9200/products
get index info
http://localhost:9200/_cat/indices?&pretty
health, status, index, pri, rep, docs.count, size
- if running on single machine, health will be Yellow
Add new product
curl -XPUT localhost:9200/products/mobiles/1 -d { "name": "", "storage": "" "reviews": ["foo", "bar"] }
response { "_index": "products", "_type": "mobiles", "_id": 1, "_version": 1, "_shards": { "total": 2 } }
to add laptops
curl -XPUT localhost:9200/products/laptops/ -d {
“name”: “”,
“storage”: “”
“reviews”: [“foo”, “bar”]
}
here ES will autogenerate ID if none is passed.
Retrieving documents
http://locahost:9200/products/phones/1
response { "_index": "products", "_type": "mobiles", "_id": 1, "_version": 1, _source": { "name": "", "storage": "" "reviews": ["foo", "bar"] } }
localhost:9200/products/phones/1&_source=false
{ "_index": "products", "_type": "mobiles", "_id": 1, "_version": 1 }
localhost:9200/products/phones/1&_source=name
{ "_index": "products", "_type": "mobiles", "_id": 1, "_version": 1, _source": { "name": "" } }
Updating docs
Full update
curl -XPUT localhost:9200/products/mobiles/1 -d { "name": "", "storage": "" "reviews": ["foo", "bar"] }
response { "_index": "products", "_type": "mobiles", "_id": 2, <<<<
Delete doc
curl -XDELETE localhost:9200/products/mobiles/1
check if doc exists
curl -i -XHEAD localhost:9200/products/mobiles/1
response: 404
delete index
curl -XDELETE localhost:9200/products
Bulk Operations
Method #1
curl -XPOST localhost:9200/_bulk -d
{“index”: {“_index”: “products”, “_type”: “mobiles”, “_id”: 10}}
{“name”: “foo”, “storage”: “”, “reviews”: “”}
{“index”: {“_index”: “products”, “_type”: “mobiles”, “_id”: 11}}
{“name”: “bar”, “storage”: “”, “reviews”: “”}
line 1 -> index and id info
line 2 -> data
Method #2 == create a .json file with all data data.json {"index": {}} {"name": "foo", "storage": "", "reviews": ""} {"index": {}} {"name": "bar", "storage": "", "reviews": ""}
curl -H “”
-XPOST
localhost:9200/products/mobiles/_bulk
data-binary @”data.json”
Search Query
there are two contexts in ES query
- query context - how relevant are the results
- filter context - does the query terms satisfy certain criterion or not
===
search all documents in customers index for word=foo in any field
localhost:9200/customers/q=foo
===
sort on age desc
localhost:9200/customers?q=foo&sort=age:desc
====
state florida and return 2 results from the 10
localhost:9200/customers?q=state:florida&from=10&size=2
===
returns all documents, also no relevance score is calculate and all docs will have score=1.0
localhost:9200/customers/_search -d
{
“query”: {“match_all”: {}}
}
===
localhost:9200/customers/_search -d { "query": {"match_all": {}}, "sort": {"age": {"order": "desc"}}, "from": 5 "size": 2 }
Term Searches
search for docs that contains foo in the name field
localhost:9200/customers/_search -d { "source": false, "query": { "term": { "name": "foo" } } }
response:
{
“hits”:{ “total”: 2, max_score: 4.1, “hits”: []}
}
- source filtering doe snot effect relevance ranking
localhost:9200/customers/_search -d { "source": [ "includes": ["name"], "excludes": ["description"] ], "query": {"term": { "name": "foo" } } }
Full Text Queries
match with options
- match
- match_phrase - want to match entire phrase not just one word
- match_phrase_prefix - search for a prefix
simple match == search for foo in name - but this is full text search not just the word. It depends on how the field has been analyzed - so it takes care of caps etc localhost:9200/customers/_search -d { "query": { "match": { "name": "foo" } } }
match name field if either foo or bar exists
localhost:9200/customers/_search -d { "query": { "match": { "name": { "query": "foo bar", "operator": "or" } } } }
- default operator is OR
match with prefix == search for all names starting with F localhost:9200/customers/_search -d { "query": { "match": { "name": "f" } } }
Boolean Query
MUST
- must - search docs should contain all words i.e. like AND
- should - search docs may or may not contain all query words i.e. like OR
- must_not - search docs must not contain any of the query words i.e. like NOT
find docs which MUST have street address as magnolia bridge. This normally gives us less results.
localhost:9200/customers/_search -d { "query": { "bool": { "must": [ "match": { "street": "magnolia" }, "match": { "street": "bridge" } ] } } }
find docs which could have street address as magnolia bridge. Both terms might not be present. This gives us more results
localhost:9200/customers/_search -d { "query": { "bool": { "must": [ "match": { "street": "magnolia" }, "match": { "street": "bridge" } ] } } }
Boosted Term Search
==
find docs which have state as CA or FL but boost all CA docs with a factor of 2 - so they will appear higher in search.
localhost:9200/customers/_search -d { "query": { "bool": { "should": [ "term": { "state": {"value": "CA", "boost": 2} }, "term": { "state": {"value": "FL"} }, ] } } }
Query with filter + bool
range query == localhost:9200/customers/_search -d { "query": { "bool": { "must": {"match_all": {}} "filter": { "range": {"age": {"gte": 20, "lte": 30}} } } } }
- this will not assign any _score to the results since its not a search query and only a range query with filter
search query with filter == find female greater than age 20 in CA localhost:9200/customers/_search -d { "query": { "bool": { "must": [ "term": { "state": {"value": "CA"} } ], "filter": { "term": {"gender": "female"}, "range": {"age": {"gte": 20}} } } } }
- this query will have _score field on each document since its a search query
Aggregations in ES
Metric
- sum, average, min, max, count etc
Bucketing
- logically group docs based on search
Matrix
Pipeline
Metric agg - average
combine aggs with query
localhost:9200/customers/_search -d { "size": 0, "aggs": { "average_age": { "avg": { "field": "age" } } } }
- size =0 indicates that we do not want any docs retrieved
find average age of all residents of CA
localhost:9200/customers/_search -d { "size": 0, "query": { "bool": { "filter": { "match": { "state": "CA" } } } } "agg": { "average_age": { "avg": { "field": "age" } } } }
Stats
localhost:9200/customers/_search -d { "size": 0, "stats": { "age_stats": { "stage": { "field": "age" } } } }
this calculates all the stats for field = age
- count, min, max, avg, sum
Cardinality agg
localhost:9200/customers/_search -d { "size": 0 "aggs": { "age_count": { "cardinality": { "field": "age" } } } }
gives all unique values for age across whole index
Bucketing agg
localhost:9200/customers/_search -d { "size": 0 "aggs": { "gender_group": { "terms": { "field": "gender" } } } }
get all counts based on gender i.e. male: 10, female:30
Configuring ES cluster
- install JDK on three machines
- install ES on all machines
elasticsearch. yaml
node. name = name of the cluster
node.master = T/F to make sure this is eligible for master
OR
node.data = T/F
network. host = IP address of this node which ES will bind
discovery. zen.ping.unicast.hosts = [ip1, ip2] => IP addresses of all other nodes in the cluster
discovery. zen.mininum_master_nodes:
Sharding
- each index by default has 5 shards and 1 replica for each shard
- shards cannot be changed after index creation
- replicas can be changed after index creation
curl -XPUT localhost:9200/myindex -d { "settings": { "number_of_shards": 2, "replicas": 0 } }
1 lucene index = 1 ES shard
2 shards + 2 replicas = 6 total shards
i.e. primary shard * (replica + 1) = 2 * (2 + 1) = 6
5 shards + 1 replica = 5 * (1 + 1) = 10
5 shards + 2 replica = 5 * (2 + 1) = 15
Routing Parameters
curl -XPUT localhost:9200/myindex/customers/1?routing=A
{
“name”: “foo”
}
it makes sure that this document is stored in a shard and associates routing parameter A with it. All other docs with the same routing parameter will be part of this shard
Bulk == curl -XPUT localhost:9200/myindex/customers/_bulk {"_index": {"_id": 1, "_routing": "A"}} {"name": "foo"} {"_index": {"_id": 1, "_routing": "A"}} {"name": "bar"} {"_index": {"_id": 1, "_routing": "B"}} {"name": "baz"}
foo and bar with be on shard with routing A
baz will be on shard with routing B
ES Similarity Models
A similarity (scoring / ranking model) defines how matching documents are scored. Similarity is per field, meaning that via the mapping one can define a different similarity per field.
curl -XPUT localhost:9200/index/_settings { "index": { "similarity": { "default": "bm25" OR "boolean" } } }
- you can specify similarity models for each field
String field configuration
- full text search
2. keyword search
Analyzers
steps
- Tokenize - split text based on space, remove punctuations,
- Normalize
- stemming - go to the root of the word i.e. running/ran => run
- add synonyms to the inverted index
- lowercase every word
- variations of each word - with spelling mistakes etc
Builtin Analyzers
- standard
- simple
- whitespace
Difference between Term queries and Match queries
Term
- looks for exact term in inverted index
- used for exact match situation i.e. numbers, dates etc
- mostly gets fewer results but most of the results are relevant
- case sensitive
Match == - query is passed through the analyzer first - mostly used for full text search - makes use of stemming and normalization - more likely to match irrelevant docs - no support for wildcards etc - case insensitive
Term v/s Match queries
create a .json file with all data data.json {"index": {}} {"name": "John Smith": ""} {"index": {}} {"name":"Jane Smith"}
curl -H “”
-XPOST
localhost:9200/products/customers/_bulk
data-binary @”data.json”
== localhost:9200/customers/_search -d { "query" : { "TERM": { "name": "John Smith"} } }
NO RESULTS
localhost:9200/customers/_search -d { "query" : { "TERM": { "name": "John"} } }
NO RESULTS
note: term query is not analyzed and it is searching for a field “name” which is analyzed. So search is case sensitive and will not match keywords “John Smith” or upper case J in case of “John”
localhost:9200/customers/_search -d { "query" : { "MATCH": { "name": "John"} } }
WILL GET RESULTS - since match query is analyzed so it will search for “john”
name is a string and hence is mapped to text field and then keyword field by default
to search for keywords you need to specify new mapping for “name” field
i.e. mappings: { name: {type: keyword}}
now the name field will not be passed through the analyzer
so the above TERM queries will fetch results.
Mappings for fields
{ "settings": "analysis": {} "mappings": {} }
=== { "settings": "analysis": { "analyzer": { "sample_analyzer": {"type": "custom", "tokenizer": "keyword", "filter": ["lowercase"]} } } "mappings": { "foo": { "properties": { "name": { "type": "keyword", "analyzer": "sample_analyzer" <<<