Elasticsearch Flashcards

1
Q

what is search

A
  • find the most relevant document that has your search terms

process

  • know of existence of the document
    i. e. web crawler, get documents and create a massive corpus of docs
  • index the document
    i.e. every document is parsed and tokenized and then indexed
    individual terms extracted and stored in a data structure called inverted index
  • know how relevant the document is compared to the search terms
    i. e. based on the search term every document will have a relevance score
  • ability to retrieve the document
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Inverted Index

A

data structure which hold a mapping between the term and the document to which it is found

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Parsing Steps

A

step #1

  • split words on space
  • all words are normalized
  • all are lowercased
  • remove punctuations

step #2
- calculate the frequency of the word in the corpus i.e multiple docs make a corpus

  • use can use different analyzers to parse documents
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Creating Inverted Index (Postings list - in search jargon)

A

words frequency document

docs1
{id: 1, words: “winter is coming”}

doc2
{id: 2, words: “it snows in winter”}

doc3
{id: 3, words: “I love hot chocolate in winter”}

winter 3 1, 2, 3
is 1 1
coming 1 1
it 1 1
snows 1 1
in 2 2, 3
i 1 3
love 1 3
hot 1 3
chocolate 1 3

  • index is sorted on words
Search                   Results doc ids
==========================
winter                    1, 2, 3
hot                          3
snows || hot             1, 3
snows && hot          None

if you want to search for words ending with LATE, then we reverse all words in the index and try to match words starting with LATE
i.e. chocolate => ETALOCOHC
so LATE becomes ETAL

search based on substrings => use ngram analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ngram analysis

A
YOURS =>
yo
you
your
ou
our
ours
ur
urs
rs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Elasticsearch

A
  • used apache lucene under the hood
  • provide search capabilities
  • provides analytics capability such as aggregates etc
  • distributed i.e. can scale to thousands of nodes
  • HA and fault tolerant i.e. multiple copies of your data are stored in the cluster (every index is replicated)
  • simple CRUD using REST api
  • powerful query DSL
  • schema less - no schema needed for docs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Commands

A

./bin/elasticsearch -Ecluster.name=foo -Enode.name=node1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Schema

A

types => logical groupings of documents
index => made up of different document types

example
1. blog engine
types = blog post => {title, content, date}
type comment => {user, content, date, }

  1. ecommerce site
    indexes => catalog, customers, inventory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sharding and Replication

A
  • one index might not be able to hold all the data

sharding - process of splitting the index onto multiple nodes or physical machines i.e. every node will only have a subset of your data

  • searches on sharded index run in parallel
  • every shard needs to have a corresponding replica.
  • a replica cannot be present on the same node as the primary shard
  • shard can have 0 to n replicas
  • by default an index in ES has 5 shards and 1 replica i.e. every shard has one backup copy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Healthcheck

A

Get cluster health
http://localhost:9200/_cat/health?&pretty

cluster status
GREEN - all shards and replicas available for search
YELLOW - some replicas not available for query
RED - some shards of certain indexes not available

Get node health
http://localhost:9200/_cat/nodes?&pretty
ip, heap, ram, cpu, load_1m, load_5m, load_15, role, master, node_name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Creating Indexes and documents

A

create new index
curl -XPUT localhost:9200/products

get index info
http://localhost:9200/_cat/indices?&pretty
health, status, index, pri, rep, docs.count, size

  • if running on single machine, health will be Yellow

Add new product

curl -XPUT localhost:9200/products/mobiles/1 -d {
  "name": "",
  "storage": ""
  "reviews": ["foo", "bar"]
}
response
{
 "_index": "products",
 "_type": "mobiles",
 "_id": 1,
"_version": 1,
"_shards": {
   "total": 2
  }
}

to add laptops
curl -XPUT localhost:9200/products/laptops/ -d {
“name”: “”,
“storage”: “”
“reviews”: [“foo”, “bar”]
}
here ES will autogenerate ID if none is passed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Retrieving documents

A

http://locahost:9200/products/phones/1

response
{
 "_index": "products",
 "_type": "mobiles",
 "_id": 1,
"_version": 1,
_source": {
    "name": "",
    "storage": ""
    "reviews": ["foo", "bar"]
 }
}

localhost:9200/products/phones/1&_source=false

{
 "_index": "products",
 "_type": "mobiles",
 "_id": 1,
"_version": 1
}

localhost:9200/products/phones/1&_source=name

{
 "_index": "products",
 "_type": "mobiles",
 "_id": 1,
"_version": 1,
_source": {
    "name": ""
 }
}
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Updating docs

A

Full update

curl -XPUT localhost:9200/products/mobiles/1 -d {
  "name": "",
  "storage": ""
  "reviews": ["foo", "bar"]
}
response
{
 "_index": "products",
 "_type": "mobiles",
 "_id": 2, <<<<
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Delete doc

A

curl -XDELETE localhost:9200/products/mobiles/1

check if doc exists
curl -i -XHEAD localhost:9200/products/mobiles/1
response: 404

delete index
curl -XDELETE localhost:9200/products

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Bulk Operations

A

Method #1

curl -XPOST localhost:9200/_bulk -d
{“index”: {“_index”: “products”, “_type”: “mobiles”, “_id”: 10}}
{“name”: “foo”, “storage”: “”, “reviews”: “”}
{“index”: {“_index”: “products”, “_type”: “mobiles”, “_id”: 11}}
{“name”: “bar”, “storage”: “”, “reviews”: “”}

line 1 -> index and id info
line 2 -> data

Method #2
==
create a .json file with all data
data.json
{"index": {}}
{"name": "foo", "storage": "", "reviews": ""}
{"index": {}}
{"name": "bar", "storage": "", "reviews": ""}

curl -H “”
-XPOST
localhost:9200/products/mobiles/_bulk
data-binary @”data.json”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Search Query

A

there are two contexts in ES query

  1. query context - how relevant are the results
  2. filter context - does the query terms satisfy certain criterion or not

===
search all documents in customers index for word=foo in any field
localhost:9200/customers/q=foo

===
sort on age desc
localhost:9200/customers?q=foo&sort=age:desc

====
state florida and return 2 results from the 10
localhost:9200/customers?q=state:florida&from=10&size=2

===
returns all documents, also no relevance score is calculate and all docs will have score=1.0

localhost:9200/customers/_search -d
{
“query”: {“match_all”: {}}
}

===

localhost:9200/customers/_search -d 
{
  "query": {"match_all": {}},
 "sort": {"age": {"order": "desc"}},
 "from": 5
 "size": 2
}
17
Q

Term Searches

A

search for docs that contains foo in the name field

localhost:9200/customers/_search -d 
{
  "source": false,
  "query": {
     "term": {
     "name": "foo"
   }
  }
}

response:

{
“hits”:{ “total”: 2, max_score: 4.1, “hits”: []}
}

  • source filtering doe snot effect relevance ranking
localhost:9200/customers/_search -d 
{
  "source": [
   "includes": ["name"],
   "excludes": ["description"]
  ],
  "query": {"term": {
     "name": "foo"
   }
  }
}
18
Q

Full Text Queries

A

match with options

  1. match
  2. match_phrase - want to match entire phrase not just one word
  3. match_phrase_prefix - search for a prefix
simple match
==
search for foo in name - but this is full text search not just the word. It depends on how the field has been analyzed - so it takes care of caps etc
localhost:9200/customers/_search -d 
{
  "query": {
     "match": {
        "name": "foo"
   }
  }
}

match name field if either foo or bar exists

localhost:9200/customers/_search -d 
{
  "query": {
     "match": {
        "name": {
          "query": "foo bar",
          "operator": "or"
      }
   }
  }
}
  • default operator is OR
match with prefix
==
search for all names starting with F
localhost:9200/customers/_search -d 
{
  "query": {
     "match": {
        "name": "f"
   }
  }
}
19
Q

Boolean Query

A

MUST

  • must - search docs should contain all words i.e. like AND
  • should - search docs may or may not contain all query words i.e. like OR
  • must_not - search docs must not contain any of the query words i.e. like NOT

find docs which MUST have street address as magnolia bridge. This normally gives us less results.

localhost:9200/customers/_search -d 
{
  "query": {
   "bool": {
      "must": [
       "match": { "street": "magnolia" },
       "match": { "street": "bridge" }
    ]
  }
  }
}

find docs which could have street address as magnolia bridge. Both terms might not be present. This gives us more results

localhost:9200/customers/_search -d 
{
  "query": {
   "bool": {
      "must": [
       "match": { "street": "magnolia" },
       "match": { "street": "bridge" }
    ]
  }
  }
}
20
Q

Boosted Term Search

A

==
find docs which have state as CA or FL but boost all CA docs with a factor of 2 - so they will appear higher in search.

localhost:9200/customers/_search -d 
{
  "query": {
   "bool": {
      "should": [
       "term": { "state": {"value": "CA", "boost": 2} },
       "term": { "state": {"value": "FL"} },
    ]
  }
  }
}
21
Q

Query with filter + bool

A
range query
==
localhost:9200/customers/_search -d 
{
  "query": {
   "bool": {
      "must": {"match_all": {}}
   "filter": {
      "range": {"age": {"gte": 20, "lte": 30}}
     }
  }
  }
}
  • this will not assign any _score to the results since its not a search query and only a range query with filter
search query with filter
==
find female greater than age 20 in CA
localhost:9200/customers/_search -d 
{
  "query": {
   "bool": {
      "must": [
       "term": { "state": {"value": "CA"} }
    ],
   "filter": {
     "term": {"gender": "female"},
      "range": {"age": {"gte": 20}}
     }
  }
  }
}
  • this query will have _score field on each document since its a search query
22
Q

Aggregations in ES

A

Metric
- sum, average, min, max, count etc

Bucketing
- logically group docs based on search

Matrix
Pipeline

23
Q

Metric agg - average

A

combine aggs with query

localhost:9200/customers/_search -d 
{
  "size": 0,
   "aggs": {
     "average_age": {
       "avg": {
          "field": "age"
       }
    }
   }
}
  • size =0 indicates that we do not want any docs retrieved

find average age of all residents of CA

localhost:9200/customers/_search -d 
{
  "size": 0,
  "query": {
   "bool": {
    "filter": {
       "match": {
          "state": "CA"
       }
     }
   }
  }
   "agg": {
     "average_age": {
       "avg": {
          "field": "age"
       }
    }
 }
}
24
Q

Stats

A
localhost:9200/customers/_search -d 
{
  "size": 0,
   "stats": {
     "age_stats": {
       "stage": {
          "field": "age"
       }
    }
   }
}

this calculates all the stats for field = age
- count, min, max, avg, sum

25
Q

Cardinality agg

A
localhost:9200/customers/_search -d 
{
  "size": 0
   "aggs": {
     "age_count": {
       "cardinality": {
          "field": "age"
       }
    }
   }
}

gives all unique values for age across whole index

26
Q

Bucketing agg

A
localhost:9200/customers/_search -d 
{
  "size": 0
   "aggs": {
     "gender_group": {
       "terms": {
          "field": "gender"
       }
    }
   }
}

get all counts based on gender i.e. male: 10, female:30

27
Q

Configuring ES cluster

A
  • install JDK on three machines
  • install ES on all machines

elasticsearch. yaml
node. name = name of the cluster

node.master = T/F to make sure this is eligible for master
OR
node.data = T/F

network. host = IP address of this node which ES will bind
discovery. zen.ping.unicast.hosts = [ip1, ip2] => IP addresses of all other nodes in the cluster
discovery. zen.mininum_master_nodes:

28
Q

Sharding

A
  • each index by default has 5 shards and 1 replica for each shard
  • shards cannot be changed after index creation
  • replicas can be changed after index creation
curl -XPUT localhost:9200/myindex -d
{
  "settings": {
    "number_of_shards": 2,
    "replicas": 0
  }
}

1 lucene index = 1 ES shard

2 shards + 2 replicas = 6 total shards
i.e. primary shard * (replica + 1) = 2 * (2 + 1) = 6

5 shards + 1 replica = 5 * (1 + 1) = 10

5 shards + 2 replica = 5 * (2 + 1) = 15

29
Q

Routing Parameters

A

curl -XPUT localhost:9200/myindex/customers/1?routing=A
{
“name”: “foo”
}
it makes sure that this document is stored in a shard and associates routing parameter A with it. All other docs with the same routing parameter will be part of this shard

Bulk
==
curl -XPUT localhost:9200/myindex/customers/_bulk
{"_index": {"_id": 1, "_routing": "A"}}
{"name": "foo"}
{"_index": {"_id": 1, "_routing": "A"}}
{"name": "bar"}
{"_index": {"_id": 1, "_routing": "B"}}
{"name": "baz"}

foo and bar with be on shard with routing A
baz will be on shard with routing B

30
Q

ES Similarity Models

A

A similarity (scoring / ranking model) defines how matching documents are scored. Similarity is per field, meaning that via the mapping one can define a different similarity per field.

curl -XPUT localhost:9200/index/_settings
{
  "index": {
     "similarity": {
        "default": "bm25" OR "boolean"
      }
   }
}
  • you can specify similarity models for each field
31
Q

String field configuration

A
  1. full text search

2. keyword search

32
Q

Analyzers

A

steps

  1. Tokenize - split text based on space, remove punctuations,
  2. Normalize
    - stemming - go to the root of the word i.e. running/ran => run
    - add synonyms to the inverted index
    - lowercase every word
    - variations of each word - with spelling mistakes etc

Builtin Analyzers

  • standard
  • simple
  • whitespace
33
Q

Difference between Term queries and Match queries

A

Term

  • looks for exact term in inverted index
  • used for exact match situation i.e. numbers, dates etc
  • mostly gets fewer results but most of the results are relevant
  • case sensitive
Match
==
- query is passed through the analyzer first
- mostly used for full text search
- makes use of stemming and normalization
- more likely to match irrelevant docs
- no support for wildcards etc
- case insensitive
34
Q

Term v/s Match queries

A
create a .json file with all data
data.json
{"index": {}}
{"name": "John Smith": ""}
{"index": {}}
{"name":"Jane Smith"}

curl -H “”
-XPOST
localhost:9200/products/customers/_bulk
data-binary @”data.json”

==
localhost:9200/customers/_search -d {
  "query" : {
    "TERM": { "name": "John Smith"}
 }
}

NO RESULTS

localhost:9200/customers/_search -d {
  "query" : {
    "TERM": { "name": "John"}
 }
}

NO RESULTS

note: term query is not analyzed and it is searching for a field “name” which is analyzed. So search is case sensitive and will not match keywords “John Smith” or upper case J in case of “John”

localhost:9200/customers/_search -d {
  "query" : {
    "MATCH": { "name": "John"}
 }
}

WILL GET RESULTS - since match query is analyzed so it will search for “john”

name is a string and hence is mapped to text field and then keyword field by default

to search for keywords you need to specify new mapping for “name” field

i.e. mappings: { name: {type: keyword}}
now the name field will not be passed through the analyzer
so the above TERM queries will fetch results.

35
Q

Mappings for fields

A
{
  "settings":
     "analysis": {}
     "mappings": {}
}
===
{
  "settings":
     "analysis": {
       "analyzer": {
          "sample_analyzer": {"type": "custom", "tokenizer": "keyword", "filter": ["lowercase"]}
       }
     }
     "mappings": {
       "foo": {
        "properties": {
          "name": {
              "type": "keyword",
              "analyzer": "sample_analyzer" <<<