Data Partitioning Flashcards
What is Data Partitioning?
- Data partitioning (also known as sharding) is a technique to break up a big database (DB) into many smaller parts.
- It is the process of splitting up a DB/table across multiple machines to improve the manageability, performance, availability and load balancing of an application.
Why should we use Data Partitioning?
• The justification for data sharding is that, after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow it vertically by adding beefier servers.
What are 3 partitioning methods?
- Horizontal partitioning
- Vertical Partitioning
- Directory-Based Partitioning
What is Horizontal partitioning? what’s another name for it? what is the problem with it?
In this scheme, we put different rows into different tables.
For example, if we are storing different places in a table, we can decide that locations with ZIP codes less than 10000 are stored in one table, and places with ZIP codes greater than 10000 are stored in a separate table.
This is also called a range based sharding, as we are storing different ranges of data in separate tables.
The key problem with this approach is that if the value whose range is used for sharding isn’t chosen carefully, then the partitioning scheme will lead to unbalanced servers.
In the previous example, splitting location based on their zip codes assumes that places will be evenly distributed across the different zip codes. This assumption is not valid as there will be a lot of places in a thickly populated area like Manhattan compared to its suburb cities.
What is vertical partitioning? what is its key disadvantage?
In this scheme, we divide our data to store tables related to a specific feature to their own server.
For example, if we are building Instagram like application, where we need to store data related to users, all the photos they upload and people they follow, we can decide to place user profile information on one DB server, friend lists on another and photos on a third server.
Vertical partitioning is straightforward to implement and has a low impact on the application.
Disadvantage: If our application experiences additional growth, then it may be necessary to further partition a feature specific DB across various servers (e.g. it would not be possible for a single server to handle all the metadata queries for 10 billion photos by 140 million users).
What is Directory Based Partitioning? What does it solve?
A loosely coupled approach to work around issues mentioned in above schemes is to create a lookup service which knows your current partitioning scheme and abstracts it away from the DB access code.
So, to find out where does a particular data entity resides, we query our directory server that holds the mapping between each tuple key to its DB server.
This loosely coupled approach means we can perform tasks like adding servers to the DB pool or change our partitioning scheme without having to impact your application.
What are 4 partition criteria?
- Key or Hash-based partitioning
- List partitioning
- Round-robin partitioning
- Composite partitioning
Partitioning Criteria: Key or hash-based: what do you know?
o Key or Hash-based partitioning
Under this scheme, we apply a hash function to some key attribute of the entity we are storing, that yields the partition number.
Example:
• we have 100 DB servers and our ID is a numeric value that gets incremented by one, each time a new record is inserted.
• The hash function could be ‘ID % 100’, which will give us the server number where we can store/read that record.
• This approach should ensure a uniform allocation of data among servers.
• Disadvantage: it effectively fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service. A workaround for this problem is to use Consistent Hashing.
Partitioning Criteria: List partitioning: what do you know?
o List partitioning
In this scheme, each partition is assigned a list of values, so whenever we want to insert a new record, we will see which partition contains our key and then store it there.
Example:
• we can decide all users living in Iceland, Norway, Sweden, Finland or Denmark will be stored in a partition for the Nordic countries
Partitioning Criteria: Round-robin partitioning: what do you know?
o Round-robin partitioning
This is a very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).
Partitioning Criteria: Composite partitioning: what do you know?
Under this scheme, we combine any of above partitioning schemes to devise a new scheme.
Example:
• first applying a list partitioning and then a hash-based partitioning.
• Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key space to a size that can be listed.
What is the key reason for partioning problems?
, operations across multiple tables or multiple rows in the same table, will no longer run on the same server.
What are 3 common problems with sharding?
a. Joins and Denormalization
b. Referential integrity
c. Rebalancing