System Design: Data Partitioning Flashcards
Data Partitioning
What is data partitioning?
Data partitioning is process of dividing a large database (DB) into smaller, more manageable parts called partitions or shards. Each partition is independent and contains a subset of the overall data.
By what criteria do people typically partition data?
By data range, data size, or data type
What are the three most popular partitioning methods?
Horizontal, vertical, and hybrid partitioning.
What is horizontal partitioning?
Horizontal partitioning (or sharding) is a database design technique where rows of a table are distributed across multiple smaller, separate tables or databases based on a predefined key, such as user ID or geographic region
What are the tradeoffs of horizontal partitioning?
Pros: it’s scalable (adding more shards can increase capacity), performant (scoping requests to a single shard is faster because it has fewer rows), and fault tolerant (failures are isolated to one node).
Cons: it’s complex (it requires additional logic to target the right shard), multiple shard queries are complicated and slow, the design is subject to hotspots (disproportionately large shards) which require effortful rebalancing, there might be data duplication or inconsistencies
What is vertical partitionining?
Sharding data based on columns rather that rows, typically separating our different data types. For example, storing username and email in one shard but storing location and demographic info in another.
This technique can help optimize performance by reducing the amount of data that needs to be scanned, especially when certain columns are accessed more frequently than others.
What is hybrid partitioning?
Hybrid data partitioning combines both horizontal and vertical partitioning techniques to partition data into multiple shards. This technique can help optimize performance by distributing the data evenly across multiple servers, while also minimizing the amount of data that needs to be scanned.
What are some common row based partitioning strategies?
- Key or Hash based partitioning
- List partitioning
- Round Robin partitioning.
- Composite partitioning.
- Range partitioning.
What is hash based partitioning?
When you apply a hash function to some key attributes of the entity you are storing that yields the partition number.
What are the trade offs of key or hash based partitioning?
Pro: It ensures a uniform allocation of data among servers.
Con: It fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service.
A workaround for this problem is to use ‘Consistent Hashing’.
What is consistent hashing?
A technique used in distributed systems to efficiently distribute keys across nodes while minimizing key movement when nodes are added or removed. It maps both keys and nodes onto a circular hash ring, assigning each key to the next node in a clockwise direction. This reduces the work necessary to redistribute keys if a node is added or removed.
What is list based partitioning?
It is when each partition is assigned a list of values. Whenever we want to insert a new record, we will see which partition contains our key and then store it there.
For example, we can decide all users living in Iceland, Norway, Sweden, Finland, or Denmark will be stored in a partition for the Nordic countries.
What is round robin partitioning?
A very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).
What is composite partitioning?
A combination any known partitioning schemes used to devise a new scheme. For example, first applying a list partitioning scheme and then a hash-based partitioning.
Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key-space to a size that can be listed
What are the problems of using joins across partitions living on a different server?
They’re often not feasible because they’re slow and hard to do across multiple servers.
One solution is to denormalize the data
Why is enforcing referential integrarity accross a partitioned database difficult?
Because most RDBMS do not support foreign key constraints across databases on different servers. This means foreign key constraints must be enforced in code and often cleaned with subsequent sql jobs.
What are some reason s we might have to change our partitioning scheme (rebalance)?
The data distribution is not uniform - e.g. a lot of places with one zip code that won’t fit a partition.
There is a lot of load on a parition - e.g. there are too many requests being handled by the DB partition dedicated to user photos.
How might you rebalance without incurring downtine?
By using directory-based paritioning.
What is directory-based partitioning and what are it’s pros and cons.
Directory-Based Partitioning is a database partitioning technique where a separate mapping directory is used to determine where a specific piece of data should be stored.
Pro: Makes rebalancing possible without downtime
Con: Increases the complexity of the system and creating a single point of failure at the lookup service/database