Distributed Systems Flashcards
Data Model
In memory data can be accessed using pointers/references.
Over the wire /from file data encoding/decoding has to happen.
Text - JSON, XML, CSV ( http)
Binary - Thrift, ProtoBuf, Avro
Schema Evolution
Avro - Backward/Forward
Why Replication ?
- To improve high availability
- Faster access
- can move the data closer to the user for faster reads
- To improve read/write throughput
- Multiple followers can serve read requests which improves throughput
Replication Algorithms: Leader based: ( Master-slave or Active-Passive replication) Single-leader Multi leader Leaderless replication
Leader based replication
Leader -> followers
- Sync writes
- client response sends back only when data is written into followers also.
- Async writes
- once leader writes data into storage immediately responds back to the client and asynchronously sends data to its followers
Replication lag
- Eventual consistency
- Read-after-write consistency
- Monotonic reads
Append only Log structured format
SS tables
LSM
Update format
B+ tree
Partition Types
Partition by Key -value
Partition by Key Range
Partition by Hash
Skewed or hot spot ?
When data is not partitioned uniformly, single partition
can have more data that other partitions (Skewed). Because of it single partition will receive high load (hot spot).
Disadvantages of Hashed Parition
It naturally supports primary key. But if multiple/columns are to be searched (secondary indexes) , then it might be difficult..
Secondary Indexes - local store (document)
Store secondary index pertaining to that partition also in same partition. (local indexes)
Disadvantages:
Since each partition can have secondary index value, All partitions have to be queried to construct end result.
Scatter/gather approach - send requests to all paritions and merge the results.
there could be tail latency problems with scatter gather.
Secondary index - term (global)
store secondary indexes in global storage. But partition the global storage also by secondary index key.
for example : if color of the car is secondary key… partition by starting letter of the color or alphabetically…
Advantages : read efficient.. no scatte gather..
disadvantages : slow write, as data might have to be written into different partitions ( primary can into 1 partition, secondary index might go into other partitions
Partition rebalancing
moving data to another node ( rebalance)
- to upgrade the node
- node removed…
Rebalancing
-
Service discovery or request routing
3 different ways to achieve is
- Client connects to any server , the server routes the request from right node and replies back
- Routing tier / partition aware load balancer can be used to route the client requests to the right node . This layer is just a routing layer
- Client aware of the server to connect to.
How to Handle partition changes in request routing ??
- can use distributed coordinator systems like zookeeper to keep the upto date information
Node can connect to zk and update the latest Information.
systems like Kafka , Hbase uses zk as metadata coordinator system
- other than external systems , can use discovery protocol like gossip to update the metadata . Cassandra uses gossip protocol
- disadvantage is more work on the node. Advantage is avoiding external system