Consistent Hashing Flashcards
When designing a scalable system, what is the most important aspect?
Deciding how the data with be partitioned and replicated across servers.
What was consistent hashing intially used for?
Distributed caching.
Why is consistent hashing necessary?
Becuase naive/simple hasing breaks down whenever a server is removed or added - which requires that all keys get remapped.
What does consistent hasing guarantee?
Consistent Hashing maps data to physical nodes and ensures that only a small set of keys move when servers are added or removed.
What is the basic design of consistent hasing?
Consistent Hashing stores the data managed by a distributed system in a ring and each node in the ring is assigned a range of data.
In consistent hashing, what is the start of a range called?
A token.
Given a token, how does a hash ring assign data to a node.
Range start: Token value
Range end: Next token value - 1
What is the algorithm used for consistent hashing?
MD5
What happens when a node is removed from the hash ring?
Tthe next node becomes responsible for all of the keys stored on the outgoing node. However, this scheme can result in non-uniform data and load distribution. This problem can be solved with the help of Virtual nodes.
What are the issues associated with manual & fixed divisions of the rangs of a hash ring?
Adding or removing nodes, hotspots, node rebuilding causing degredation of the replica.
What are VNodes?
Non-contiguous subdivisions - ranges - of a hash ring which, when rebalancing, can be distributed across all remaining (or an additional) nodes as opposed to one large range having to be reassigned or divided up.
What are the advantages of VNodes?
They speed up the rebalancing process after adding or removing nodes, they make it easier to maintain a cluster containing heterogeneous machines (i.e. some, more powerful, machines can be assigned a greater number of ranges), and they decrease the probability of hotspots.
What is replication factor?
The replication factor is the number of nodes that will receive the copy of the same data. For example, a replication factor of two means there are two copies of each data item, where each copy is stored on a different node, determined by walking clockwise from the primary replica’s node.
Ownership of a token range is only used to determine the primary replica.
Replica placement is a separate decision, and replica nodes may not “own” the hash of the data they store.
What does eventual consistency mean in this context of consistent hashing?
In eventually consistent systems, copies of data don’t always have to be identical as long as they are designed to eventually become consistent. In distributed systems, eventual consistency is used to achieve high availability.
Any distributed system that needs to scale up or down or wants to achieve high availability through data replication can utilize Consistent Hashing. A few such examples could be:
- Any system working with a set of storage (or database) servers and needs to scale up or down based on the usage, e.g., the system could need more storage during Christmas because of high traffic.
- Any distributed system that needs dynamic adjustment of its cache usage by adding or removing cache servers based on the traffic load.
- Any system that wants to replicate its data shards to achieve high availability.