Elastic Search Introduction Flashcards

1
Q

Logstash

A

Logstash and Beats facilitate collecting, aggregating, and enriching your data and storing it in Elasticsearch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Kibana

A

Kibana enables you to interactively explore, visualize, and share insights into your data and manage and monitor the stack.

Elasticsearch enable you to use Kibana as a control center for managing a cluster. Features like data rollups and index lifecycle management help you intelligently manage your data over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Elasticsearch

A

Elasticsearch is where the indexing, search, and analysis magic happens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data is stored in a distributed fashion in ES

A

Elasticsearch is a distributed document store.
Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

DataStructure used in ES

A

Inverted index
An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Index in ES?

A

An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Structure of every field in ES

A

By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure. For example, text fields are stored in inverted indices, and numeric and geo fields are stored in BKD trees. The ability to use the per-field data structures to assemble and return search results is what makes Elasticsearch so fast.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

SchemaLess ES

A

documents can be indexed without explicitly specifying how to handle each of the different fields that might occur in a document. When dynamic mapping is enabled, Elasticsearch automatically detects and adds new fields to the index. This default behavior makes it easy to index and explore your data—​just start indexing documents and Elasticsearch will detect and map booleans, floating point and integer values, dates, and strings to the appropriate Elasticsearch data types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Defining your own mappings enables you to:

A

Distinguish between full-text string fields and exact value string fields
Perform language-specific text analysis
Optimize fields for partial matching
Use custom date formats
Use data types such as geo_point and geo_shape that cannot be automatically detected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Indexing same field in different ways

A

It’s often useful to index the same field in different ways for different purposes. For example, you might want to index a string field as both a text field for full-text search and as a keyword field for sorting or aggregating your data. Or, you might choose to use more than one language analyzer to process the contents of a string field that contains user input.

The analysis chain that is applied to a full-text field during indexing is also used at search time. When you query a full-text field, the query text undergoes the same analysis before the terms are looked up in the index.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Search Engine Library

A

Apache Lucene

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Scalability

A

Elasticsearch is built to be always available and to scale with your needs. It does this by being distributed by nature. You can add servers (nodes) to a cluster to increase capacity and Elasticsearch automatically distributes your data and query load across all of the available nodes. No need to overhaul your application, Elasticsearch knows how to balance multi-node clusters to provide scale and high availability. The more nodes, the merrier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does scalability works

A

How does this work? Under the covers, an Elasticsearch index is really just a logical grouping of one or more physical shards, where each shard is actually a self-contained index. By distributing the documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy, which both protects against hardware failures and increases query capacity as nodes are added to a cluster. As the cluster grows (or shrinks), Elasticsearch automatically migrates shards to rebalance the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Types of Shards

A

There are two types of shards: primaries and replicas. Each document in an index belongs to one primary shard. A replica shard is a copy of a primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document.

The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time, without interrupting indexing or query operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How many shards to configure

A

There are a number of performance considerations and trade offs with respect to shard size and the number of primary shards configured for an index. The more shards, the more overhead there is simply in maintaining those indices. The larger the shard size, the longer it takes to move shards around when Elasticsearch needs to rebalance a cluster.

Querying lots of small shards makes the processing per shard faster, but more queries means more overhead, so querying a smaller number of larger shards might be faster. In short…​it depends.

As a starting point:

Aim to keep the average shard size between a few GB and a few tens of GB. For use cases with time-based data, it is common to see shards in the 20GB to 40GB range.
Avoid the gazillion shards problem. The number of shards a node can hold is proportional to the available heap space. As a general rule, the number of shards per GB of heap space should be less than 20.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In case of disaster

A

For performance reasons, the nodes within a cluster need to be on the same network. Balancing shards in a cluster across nodes in different data centers simply takes too long. But high-availability architectures demand that you avoid putting all of your eggs in one basket. In the event of a major outage in one location, servers in another location need to be able to take over. Seamlessly. The answer? Cross-cluster replication (CCR).

CCR provides a way to automatically synchronize indices from your primary cluster to a secondary remote cluster that can serve as a hot backup. If the primary cluster fails, the secondary cluster can take over. You can also use CCR to create secondary clusters to serve read requests in geo-proximity to your users.

Cross-cluster replication is active-passive. The index on the primary cluster is the active leader index and handles all write requests. Indices replicated to secondary clusters are read-only followers.