Sharding Flashcards

1
Q

Sharding

A

the horizontal scaling of a database system that is accomplished by breaking the database up into smaller “shards”, which are separate database servers that all contain a subset of the overall dataset.

Modern software services increasingly collect and use more data than can fit on a single machine. The capacity of the database server can be increased, but eventually runs into a physical limit. The alternative is splitting up data across a cluster of database servers.

Database sharding splits up data in a particular way so that data access patterns can remain as efficient as possible. A shard is a horizontal partition, meaning the database table is split up by drawing a horizontal line between rows. This is in contrast to a vertical partition, where partitions are made between columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How sharding works

A

Each row is assigned a shard key that maps it to the logical shard it can be found on. More than one logical shard can be located on the same physical shard, but a logical shard can’t be split between physical shards.

When creating a sharded architecture, the goal is to have many small shards to create an even distribution of data across the nodes. This prevents hotspots from overwhelming any one of the nodes and produces fast response times for all nodes.

Sharding can be implemented either at the application level or the database level. At this point, most databases support sharded architectures, with the notable exception of PostgresQL, one of the top relational databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Denormalization

A

This process of making sure there aren’t relational constraints between different shards is called denormalization. Denormalization is achieved by duplicating data (which adds write complexity) and grouping data so that it can be accessed in a single row. Notably, denormalized data is different from non-normalized data where relationships between data are unknown rather than located on a single table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When to use sharding

A

The benefits of a sharded architecture, versus other kinds of database partitioning are:

Leveraging average hardware instead of high end machines
Quickly scaling by adding more shards
Better performance because each machine is under less load
Sharding is particularly useful when a single database server:
-Can’t hold all the data
-Can’t compute all the query responses fast enough
-Can’t handle the number of concurrent connections
You might also need sharding when you need to maintain distinct geographic regions, even if the above compute constraints haven’t been hit. Either your service will be faster when the data servers are physically closer to the users, or there’s legislation about data location and usage in one of the countries your service operates in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When not to use sharding

A

The disadvantages of database sharding are all about complexity. The queries become more complex because they have to somehow get the correct shard key, and need to be aware of avoiding multi-shard queries.

If the shards can’t be entirely isolated, you need to implement eventual consistency for duplicated data or upholding relational constraints. The implementation and deployment of your database get a lot more complex, as do failovers, backups, and other kinds of maintenance. Essentially - you should only use database sharding when you absolutely have to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Shard key

A

Each row is assigned a shard key that maps it to the logical shard it can be found on. shard keys need to be unique across shards. Shard keys are derived from some invariant feature of the data that utilizes business logic to optimize for the most common queries. Common choices are tenant ids, location, and timestamps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Range sharding approach

A

Data can be assigned to a shard based on what “range” it falls into. For example, a database with sequential time-based data like log history could shard based on month ranges. One big advantage of range-based shard keys is they make sequential access patterns very fast, because data that is “close” in the given range will be on the same shard.

One downside to ranges is the balance of data can be unpredictable. For example, an e-commerce company might have a lot more orders in December because of holiday shopping, so the shard with the December range could get overwhelmed while the other shards aren’t doing much.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hash sharding approach

A

To address the issue of imbalanced shards, data can be distributed based on a hash of part of the data. An effective hash function will randomize and distribute incoming data to prevent any access patterns that could overwhelm a node. For example, the profile pages of celebrities get substantially more traffic than the average user, so a hash function can be used to randomly distribute these celebrity users and prevent hotspots.

One major advantage of a hash-based sharded architecture is the shard key can be computed by any server that knows the hash function, so there’s no centralized point of failure.

One big downside of hashing is adding shards can require a lot of overhead, depending on the implementation. Consistent hashing limits this by guaranteeing a minimum amount of data will have to be moved when a new node is added.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Name Node sharding approach

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly