Data Partitioning Flashcards

1
Q

What is Data Partitioning?

A

Data partitioning is a technique to break up a big database (DB) into many smaller parts. It is the process of splitting up a DB/table across multiple machines to improve the manageability, performance, availability, and load balancing of an application.

The justification for data partitioning is that, after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow it vertically by adding beefier servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(Partitioning Methods)

Horizontal partitioning

A

In this scheme, we put different rows into different tables.

This is also called a range based partitioning as we are storing different ranges of data in separate tables. Horizontal partitioning is also called as Data Sharding.

The key problem with this approach is that if the value whose range is used for partitioning isn’t chosen carefully, then the partitioning scheme will lead to unbalanced servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(Partitioning Methods)

Vertical Partitioning

A

In this scheme, we divide our data to store tables related to a specific feature in their own server.

Vertical partitioning is straightforward to implement and has a low impact on the application. The main problem with this approach is that if our application experiences additional growth, then it may be necessary to further partition a feature specific DB across various servers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(Partitioning Methods)

Directory Based Partitioning

A

A loosely coupled approach to work around issues mentioned in the above schemes is to create a lookup service which knows your current partitioning scheme and abstracts it away from the DB access code. So, to find out where a particular data entity resides, we query the directory server that holds the mapping between each tuple key to its DB server. This loosely coupled approach means we can perform tasks like adding servers to the DB pool or changing our partitioning scheme without having an impact on the application.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(Partitioning Criteria)

Key or Hash-based partitioning

A

Under this scheme, we apply a hash function to some key attributes of the entity we are storing; that yields the partition number.

For example, if we have 100 DB servers and our ID is a numeric value that gets incremented by one each time a new record is inserted. In this example, the hash function could be ‘ID % 100’, which will give us the server number where we can store/read that record. This approach should ensure a uniform allocation of data among servers.

The fundamental problem with this approach is that it effectively fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service. A workaround for this problem is to use Consistent Hashing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(Partitioning Criteria)

List partitioning

A

In this scheme, each partition is assigned a list of values, so whenever we want to insert a new record, we will see which partition contains our key and then store it there.

For example, we can decide all users living in Iceland, Norway, Sweden, Finland, or Denmark will be stored in a partition for the Nordic countries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(Partitioning Criteria)

Round-robin partitioning

A

This is a very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(Partitioning Criteria)

Composite partitioning

A

Under this scheme, we combine any of the above partitioning schemes to devise a new scheme. For example, first applying a list partitioning scheme and then a hash based partitioning. Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key space to a size that can be listed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Name 3 Common Problems of Data Partitioning

A

Joins and Denormalization
Referential integrity
Rebalancing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

(Problems of Data Partitioning)

Joins and Denormalization

A

Performing joins on a database which is running on one server is straightforward, but once a database is partitioned and spread across multiple machines it is often not feasible to perform joins that span database partitions. Such joins will not be performance efficient since data has to be compiled from multiple servers. A common workaround for this problem is to denormalize the database so that queries that previously required joins can be performed from a single table. Of course, the service now has to deal with all the perils of denormalization such as data inconsistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

(Problems of Data Partitioning)

Referential integrity

A

As we saw that performing a cross-partition query on a partitioned database is not feasible, similarly, trying to enforce data integrity constraints such as foreign keys in a partitioned database can be extremely difficult.

Most of RDBMS do not support foreign keys constraints across databases on different database servers. Which means that applications that require referential integrity on partitioned databases often have to enforce it in application code. Often in such cases, applications have to run regular SQL jobs to clean up dangling references.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

(Problems of Data Partitioning)

Rebalancing

A

There could be many reasons we have to change our partitioning scheme:

  • The data distribution is not uniform, e.g., there are a lot of places for a particular ZIP code that cannot fit into one database partition.
  • There is a lot of load on a partition, e.g., there are too many requests being handled by the DB partition dedicated to user photos.

In such cases, either we have to create more DB partitions or have to rebalance existing partitions, which means the partitioning scheme changed and all existing data moved to new locations. Doing this without incurring downtime is extremely difficult. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly