Leader election algorithms Flashcards
Leader election algorithm
describes how a cluster of nodes without a leader can communicate with each other to choose exactly one of themselves to become the leader. The algorithm is executed whenever the cluster starts or when the leader node goes down.
Generally, the algorithm works by assigning one of three states to each node: Leader, Follower, or Candidate. Additionally the leader will be required to regularly pass a “healthcheck” or “heartbeat” so follower nodes can tell if the leader has become unavailable or failed and a new one needs to be elected. The kind of leader election algorithm you want depends on whether the cluster is synchronous or asynchronous.
When to use a leader election algorithm
- when each node is roughly the same and there isn’t a clear candidate for a permanently assigned leader. This means any node can be elected as leader, and there isn’t a single point of failure required to coordinate the system.
- when the cluster is doing particularly complex work that needs good coordination. Coordination can mean anything from decisions about how the work is to be divided, to assigning work to specific nodes, or to synthesizing the results of work from different nodes.
- The third case where leader election adds value is when a system executes many distributed writes to data and requires strong consistency. You can read more about consistency in our article on Databases, but essentially this means it’s very important that no matter what node handles a request the user will always have the most up-to-date version of the data. In this situation a leader creates consistency guarantees by being the source of truth on what the most recent state of the system is (and the leader election algorithm must preserve this properly).
Not all applications require strong consistency, but you can imagine how it might be important to a bank to ensure that no matter what server answers a user’s online banking request their bank account total will be accurate, and that multiple transactions directed to the same bank account won’t conflict with each other.
Drawbacks
The main downside to leader election is complexity: a bad implementation can end up with “split brain” where two leaders try to control at the same time, or no leader is elected and the cluster can’t coordinate. As such, leader election should only be used when there is a need for complex coordination or strong consistency, and none of the alternatives fit the situation.
A leader is a single node, so it can become a bottleneck or temporary single point of failure. Additionally, if the leader starts making bad decisions (whatever that means in the context of directing work for the service), the followers will just do what they’re assigned, possibly derailing the entire cluster.
The leader / follower model generally makes the best practices of partial deployment and A/B testing harder by requiring the whole cluster to follow the same protocols or be able to respond uniformly to the same leader.
Synchronous cluster/algorithm
In a synchronous cluster nodes are synchronized to the same clock and send messages in predictable amounts of time and ordering. Synchronous algorithms can guarantee both safety - that no more than one leader will be elected - and liveness - that every node will finish the election, and are therefore easier to reason about and theoretically preferable. But in practice the big drawback is synchronizing a cluster requires implementing additional constraints on how the cluster operates that aren’t always feasible or scalable.
Asynchronous cluster/algorithm
In an asynchronous cluster messages are not reliably delivered within a certain amount of time or in any order. any number of nodes can lag indefinitely so the leader election process can’t guarantee both safety - that no more than one leader will be elected - and liveness - that every node will finish the election. In practice, implementations choose to guarantee safety because it has more critical implications for the service.
Bully Algorithm
a simple synchronous leader election algorithm. This algorithm requires that each nodes has a unique numeric id, and that nodes know the ids of all other nodes in the cluster.
The election process starts when a node starts up or when the current leader fails the healthcheck. There are two cases:
if the node has the highest id, it declares itself the winner and sends this message to the rest of the nodes.
if the node has a lower id, it messages all nodes with higher ids and if it doesn’t get a response, it assumes all of them have failed or are unavailable, and declares itself the winner.
The main downside of the bully algorithm is that if the highest-ranked node goes down frequently, it will re-claim leadership every time it comes back online, causing unnecessary reelections. Synchronization of messages can also be difficult to maintain, especially as the cluster gets larger and physically distributed.
Paxos Algorithm
a general consensus protocol that can be used for asynchronous leader election. Paxos uses state machine replication to model the distributed system, and then chooses a leader by having some nodes propose a leader, and some nodes accept proposals. When a quorum of (enough of) the accepting nodes choose the same proposed leader, that proposed leader becomes the actual leader.
RAFT algorithm
an alternative to Paxos that is favored because people tend to find it simpler to understand, and therefore easier to implement and use. Raft is an asynchronous algorithm.
In Raft consensus, each node keeps track of the current “election term”. When leader election starts each node increments its copy of the term number and listens for messages from other nodes. After a random interval, if the node doesn’t hear anything, it will become a candidate leader and ask other nodes for votes.
If the candidate ever reaches a majority of votes, it becomes a leader, and if it ever receives a message from another candidate with a higher term number, it concedes. The algorithm restarts if the election is split or times out without consensus. Restarts don’t happen too often because the random timeouts help make it so nodes don’t usually conflict.
Apache ZooKeeper (ZAB) algorithm
Apache Zookeeper is a centralized coordination service that is “itself distributed and highly reliable.” The ethos behind Apache ZooKeeper is that coordination in distributed systems is difficult, and it’s better to have a shared open source implementation with all the key elements so that your service doesn’t have to reimplement everything from scratch. This is especially helpful in large distributed systems.
ZAB (ZooKeeper Atomic Broadcast) is the protocol used by Apache ZooKeeper to handle leader election, replication order guarantees, and node recovery. It is called this because the leader “broadcasts” state changes to followers to make sure writes are consistent and propagated to all nodes. ZAB is an asynchronous algorithm.
ZAB is focused on making sure the history of the cluster is accurate through leadership transitions. The leader is chosen such that it has the most up to date history (it has seen the most recent transaction). When enough of the nodes agree that the new leader has the most up to date history, it syncs history with the cluster and finishes the election by recording itself as leader.
Locking model
An alternative to leader election. A locking model ensures that concurrent operations on a shared resource don’t conflict by only allowing changes from one node at a time. With optimistic locking a node will read a resource and its version id, make changes, and then before updating make sure that the version id is the same. If the id is different this means the resource has been updated since the node first read it. Going forward with the intended changes based on the old id would lose the other changes, so the node needs to try again.
In pessimistic locking a node locks the resource, makes changes, and then unlocks the resource. If another node tries to initiate a change while the resource is locked, it will fail and try again later. Pessimistic locking is more rigorous, but can be hard to implement and bugs can cause deadlocks that stop a system from functioning.
These locking patterns are named for use cases. Optimistic locking is useful when you can make the “optimistic” assumption that another node won’t change the resource out from under the operation. And pessimistic locking is useful when you can make the “pessimistic” assumption that there will be contention for the resource.
Idempotent APIs
An alternative to leader election. APIs can have the feature of idempotency to ensure consistent interactions with a shared resource. An API is idempotent when the same request sent multiple times will not produce any inconsistent results. When reading from a resource, this means the response will always be the same value. When writing, this means the update will only happen once.
For example, idempotent writes can be implemented by requiring request ids so the system can tell if a request is being retried. Idempotency is also supported by other features we’ve talked about, like locking and database transactions.
An intuitive example of an idempotent API is bank account transfers: if a user initiates an online bank transfer and their internet goes down halfway through processing, you want to make sure the user can initiate the transfer again and your system will correctly only transfer the amount once.
Workflow Engines
An alternative to leader election.