Data Intensive Ch6 - Partitioning Flashcards
Paritioning synonyms
Shardy (ES, MongoDB) Region (HBase) Tablet (Bigtable) VNode (cassandra, Riak) vBucket (Couchbase)
Partitioning
Usually each record of data (row, document) belongs to exactly one partition
Each partition acts as small DB on its own
Some DBs support operations spanning objects from many partitions
Each node can be responsible for serving more than one partition
Why do we need partitioning?
Scaling out
Different partitions spread among nodes
Enables:
Large dataset is partitioned over multiple hard disks
Queries can be balanced over many processing nodes
What is Skewed partitions?
What is hot spots partitions?
How to fix issues associted with them?
Ideally data and queries should be distributed evenly between partitions
Skewed partition - partition which holds more data than others or is overloaded with queries (hot spot)
Can be avoided by random spreading of records to partitions
^ makes harder to establish which node the data is stored at on read (and querying all nodes is waste of processing power)
Paritioning by Key Range
Select some continuous range of keys within min-max
Each partition gets continuous subrange of key from that range
Encyclopedia like
A-B, C-F, G-J
Subranges are not required to have even number of keys
^ should be determined by data distribution for each key
Subranges must be even data-wise not key-wise
Paritioning by Hash Of Key
Each record should have a key (like primary key etc)
Compute a hash of the key
Make partition based on HASH subranges
Con: range queries require being sent to all partitions
Compound primary key Cassandra
Takes hash of the first key component.
Stores data under index which concats all segments
No need for crypto-strong functions
MD5 hash function is ok
The most important is to uniformly distribute keys
Caveat - Java’s hashCode and others might be return non-repeatable values accross processes!!!
Secondary index and partitioning
Problem - secondary index is not usually unique. Multiple records fall into the same index-value.
Example - car catalogue, search by colour
Options:
- partition secondary index by document
Each secondary index is kept in the partition for the data stored within partition.
Disadvantage: when querying all partitions must be
accessed (scatter/gather).
Updates or updates are easier though as can happen locally
- partition secondary index by term
Global Secondary index - for data from all partitions.
Can be partitioned separately (or becomes a bottleneck!)
Example: Colours of cars starting from a to r goes to partition 1 and the rest goes to partition 2.
Term partitioned - partition to find index is determined by the searched value
Hash index can be used as well.
Advantages: no scatter/gather
Updates/deletes become complex - different nodes must be updated (those with data and those with index)
Which becomes distributed trx and they suck.
Scatter/gather
Database query approach when data is partitioned over many nodes and there is no information which partition data would be stored.
Query requests are scattered over all partitions and then results are gathered together
Used in elasticsearch fex.
Rebalancing
Why:
Dataset grows larger than node’s disk
CPU is not able to handle query traffic
Nodes can fail and failover standby would need to take over
What is expected from rebalancing:
Load is evenly spread among nodes
During rebalancing system still accepts reads and writes
Operation should be as fast as possible - minimizing network traffic or disk operations
How to do it:
- hash mod N
WRONG way - changing N invalidates most keys partition assignment. Migrating ALMOST ALL keys is expensive
- fixed number of partitions
Create more partitions than nodes
Each node takes care of many partitions
When new node joins the cluster it can “steal” partitions from other
Only migrating partitions are moved around the network
- Dynamic partitioning
Used in fex HBase, RethinkDB
When partition size exceeds configurable threshold like 10GBs it’s halved in size
If partitions size drops lower than minimum size threshold it can be merged with adjacent partitions (similar to B-Trees supposedly)
Pros: partitions are proportionate to data size
Issue: in the beggining DB consists of single partition. Pre-splitting can be used to avoid unnecessary initial partitioning
- partitioning proportionally to nodes
Each node has a fixed num of partitions
New node joins and “steals” RECORDS from some partitions by splitting them
Can lead to unfair splits
When there’re many partitions they all converge to even average distribution
Automatic or manual?
Automatic is risky as triggers data migrations and rerouting of requests/tasks which is expensive.
It can be risky to allow it happen at anytime
If failure detection is automatic (based on timeout fex) rebalancing can lead to cascading failures -> overloaded node is turned off, other nodes are hit with spike of traffic and on top of that they participate in rebalancing so more nodes shutdown etc
Riak, Voldemort - rebalancing proposal auto-generations which must be manually confirmed
Rebalancing - Request routing
p214
General problem - service discovery
Approches:
1. Load balancer round robin - if node does not own the partition it forwards the request to partition-owner
2, requests made to routing tier which routes to partition-owner AKA partition aware load balancer
3. Client-partition-aware - client is aware of partition nodes and connect directly to the proper node
Coordinator and partitions must agree with each other - consensus problem example
ZooKeeper - coordination service that can be used to keep cluster meta regarding partitioning (node to partition assignments)
Nodes register themselves in zookeeper, routing subscribes to data change
Partitioning by hash vs by range
Range queries are easier with the latter.
Finding good subranges is hard so randomly distributing using the former is easier