6. Partitioning Flashcards
Partitioning
Normally, partitions are defined in such a way that each piece of data (each record, row, or document) belongs to exactly one partition.
Purpose of partitioning
The main reason for partition data is scalability. Different partitions can be placed on different nodes in a shard-nothing cluster.
Partitioning and replication
A node may store more than one partition.
Combining replication and partitioning: each node acts as leader for some partitions and follower for other partitions.
Partitioning approaches - 1. Key range partitioning
Keys are sorted. A partition owns all the keys from some minimum up to some maximum.
Pros: Sorting enable efficient range queries
Cons: Risk of hot spots if application often accesses keys that are close together in the sorted order
In this approach, partitions are typically rebalanced dynamically by splitting the range into two subranges when a partition gets too big.
Partitioning approaches - 2. Hash partitioning
A hash function is applied to each key, and a partition owns a range of hashes.
Pros: Distribute load more evenly
Cons: Make range queries inefficient
When partitioning by hash, it is common to create a fixed number of partitions in advance, to assign several partitions to each node, and to move entire partitions from one node to another when nodes are added or removed. Dynamic partitioning can slo be used.
Partitioning and secondary indexes - 1. Document-partitioned indexes
Secondary indexes are stored in the same partition as the primary key and value.
Pros: Only a single partition needs to be updated on write
Cons: A read of secondary index requires a scatter/gather across all partitions.
Partitioning and secondary indexes - 2. Term-partitioned indexes
Secondary indexes are partitioned separately, using the indexes values. An entry in the secondary index may include records from all partitions of the primary key.
Pros: A read can be served from a single partition.
Cons: When a document is written, several partitions of the secondary index need to be updated.
Rebalancing partitions
- How not to do it: hash mod N
- Fixed number of partitions
- The number of partitions does not change, nor does the assignment of keys to partitions. The only thing that changes is the assignment of partitions to nodes.
- Create many more partitions than there are nodes - Dynamic partitioning
- Partitioning proportionally to nodes
Request routing
- Allow clients to contact any node (round-robin load balancer)
- Send request to a routing tier first (partition aware load balancer)
- Require clients to be aware of partitioning
Key problem: How does the component making the routing decision and learn the assignments of partition change?
- Zookeeper. Each node register itself in zookeeper.