System Design: Data Partitioning Flashcards

Data Partitioning

1
Q

What is data partitioning?

A

Data partitioning is process of dividing a large database (DB) into smaller, more manageable parts called partitions or shards. Each partition is independent and contains a subset of the overall data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

By what criteria do people typically partition data?

A

By data range, data size, or data type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the three most popular partitioning methods?

A

Horizontal, vertical, and hybrid partitioning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is horizontal partitioning?

A

Horizontal partitioning (or sharding) is a database design technique where rows of a table are distributed across multiple smaller, separate tables or databases based on a predefined key, such as user ID or geographic region

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the tradeoffs of horizontal partitioning?

A

Pros: it’s scalable (adding more shards can increase capacity), performant (scoping requests to a single shard is faster because it has fewer rows), and fault tolerant (failures are isolated to one node).

Cons: it’s complex (it requires additional logic to target the right shard), multiple shard queries are complicated and slow, the design is subject to hotspots (disproportionately large shards) which require effortful rebalancing, there might be data duplication or inconsistencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is vertical partitionining?

A

Sharding data based on columns rather that rows, typically separating our different data types. For example, storing username and email in one shard but storing location and demographic info in another.

This technique can help optimize performance by reducing the amount of data that needs to be scanned, especially when certain columns are accessed more frequently than others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is hybrid partitioning?

A

Hybrid data partitioning combines both horizontal and vertical partitioning techniques to partition data into multiple shards. This technique can help optimize performance by distributing the data evenly across multiple servers, while also minimizing the amount of data that needs to be scanned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some common row based partitioning strategies?

A
  1. Key or Hash based partitioning
  2. List partitioning
  3. Round Robin partitioning.
  4. Composite partitioning.
  5. Range partitioning.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is hash based partitioning?

A

When you apply a hash function to some key attributes of the entity you are storing that yields the partition number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the trade offs of key or hash based partitioning?

A

Pro: It ensures a uniform allocation of data among servers.

Con: It fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service.

A workaround for this problem is to use ‘Consistent Hashing’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is consistent hashing?

A

A technique used in distributed systems to efficiently distribute keys across nodes while minimizing key movement when nodes are added or removed. It maps both keys and nodes onto a circular hash ring, assigning each key to the next node in a clockwise direction. This reduces the work necessary to redistribute keys if a node is added or removed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is list based partitioning?

A

It is when each partition is assigned a list of values. Whenever we want to insert a new record, we will see which partition contains our key and then store it there.

For example, we can decide all users living in Iceland, Norway, Sweden, Finland, or Denmark will be stored in a partition for the Nordic countries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is round robin partitioning?

A

A very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is composite partitioning?

A

A combination any known partitioning schemes used to devise a new scheme. For example, first applying a list partitioning scheme and then a hash-based partitioning.

Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key-space to a size that can be listed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the problems of using joins across partitions living on a different server?

A

They’re often not feasible because they’re slow and hard to do across multiple servers.

One solution is to denormalize the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is enforcing referential integrarity accross a partitioned database difficult?

A

Because most RDBMS do not support foreign key constraints across databases on different servers. This means foreign key constraints must be enforced in code and often cleaned with subsequent sql jobs.

17
Q

What are some reason s we might have to change our partitioning scheme (rebalance)?

A

The data distribution is not uniform - e.g. a lot of places with one zip code that won’t fit a partition.

There is a lot of load on a parition - e.g. there are too many requests being handled by the DB partition dedicated to user photos.

18
Q

How might you rebalance without incurring downtine?

A

By using directory-based paritioning.

19
Q

What is directory-based partitioning and what are it’s pros and cons.

A

Directory-Based Partitioning is a database partitioning technique where a separate mapping directory is used to determine where a specific piece of data should be stored.

Pro: Makes rebalancing possible without downtime
Con: Increases the complexity of the system and creating a single point of failure at the lookup service/database