Chapter 5: Coordination & Agreement Flashcards
Synchronous distributed system model
- Each message transmitted over a channel is received within a known bounded time
- Time to execute each step of a process has known lower and upper bounds
- Each process has a local clock whose driftrate from real time has a known bound
Asynchronous distributed system model NN
- Messages may need an arbitrary time to be transmitted
- Each step of a process may take an arbitrary time to complete
- Clock driftrates are arbitrary
Thought experiment - Some takeaways
- Asynchronous model is representative for Internet and many practical distributed applications
- Solution valid for asynchronousd distributed systems are also valid for synchronous ones
- Many design problems can not besolved in an asynchronous world (e.g., when vs. who leads attack)
- Applyt timeouts and timing constraints to reduce uncertainty and to bring synchronous model into the picture
Failure models
- Process and communication channel may fail, i.e., depart from correct and expected behaviour
- Failure model offers understanding for effects of failures
- Failure types
- Omission failures (one ore more responses fail)
- Arbitrary failures (wrong value)
- Timing failures (outside interval)
Process omission failures
- Fails by halting, i.e., crashes (e.g. bluescreen)
-
Fail‐stop failure
- Process halts and remains halted
- Others can detect this state
-
Crash failure
- Process halts and remains halted
- Others can (only) suspect this state
Communication omission failure
- Receiving process does not receive message
- Typically becaue of
- Buffer overlfow
- Network partition
- Failure types
- Send-omission
- Receive-omission
- Channel-omission
- Process vs. communication failure: Generally, impossible to distinguish
In our thought experiment NN
- Could Army i detect whether Army j has been defeated? (i.e., process failed?)
- Assuming the enemy could attack
- While undefeated, send periodic message (e.g. hearbeat)
- Assume enemy can not corrupt message
Arbitrary failures
- Encompasses prior failure classes
- Process not just stops or crashes but processes requests incorrectly, corrupts state, produces inconsistent & incorrect output (commission failures)
- Does not follow agreed protocol, messages get corrupted, messages introduced
- Result:Systembehavesunpredictable
- A.k.a. Byzantine failures / Produces –> produces wrong output
Access to shared variables and RACE condition
- Imaging a globally shared variable counter in a process accessible to multiple threads:
- counter ++
- counter –
- Register could be 4, 6 or 5
- Race condition:
- Several threads manipulate shared data concurrently. The final value of the data depends upon which thread finishes last.
- To prevent race conditions, concurrent processes must be synchronized
- The statements: counter++; counter‐‐; must each be executed atomically.
- Atomic operation means an operation that completes in its entirety without interruption.
- This is achieved through synchronization primitives
- Shared variable accessed in critical section, protected by synchronization primitives
- Known as the critical section problem or as mutual exclusion
Distributed mutual exclusion
- In distributed systems, mutual exclusion is more complex due to lack of:
- Shared memory
- Timing issues
- Lack of a global clock
- Event ordering
- Applications
- Accessing a shared resource in distributed systems
- Communicating over a wireless channel 802.11
Critical section (CS) problem: No shared memory
- System with n processes
- Processes access shared resources in CS
- Coordinate access to CS via message passing
- Application‐level protocol for accessing CS
- Enter_CS() – enter CS, block if necessary
- ResourceAccess() – access shared resource in CS
- Exit_CS() – leave CS
Assumptions for CS NN
- System is asynchronous
- No bound on delays, no bound on clock drift, etc.
- Processes do not fail
- Message delivery is reliable
- Any message sent, is eventually delivered intact and exactly once
- Client processes are well‐behaved and spent finite time accessing resources in the CS
Mutual exclusion requirements CS
-
Safety
- At most one process in the critical section at a time
-
Liveness
- Requests to enter & exit CS eventually succeed
- No deadlock
-
Fairness (order & starvation)
- If one request to enter CS happened‐before another one, then entry to CS is granted in that order
- Requests are ordered such that no process enters the critical section twice while another waits to enter (i.e., no starvation)
Deadlock & starvation
- Deadlock: Two or more processes become stuck indefinitely while attempting to enter and exit the CS, by virtue of their mutual dependency
- Starvation: The indefinite postponement of entry to CS for a process that has requested it.
- Can we order entry to the CS by the times processes requested it?

Performance metrics NN
- Bandwidth: Number of messages sent, received or both
- Synchronization delay: Time between one process exiting the critical section and the next entering
- Client delay: Delay at entry and exit (response time)
- We do not measure client access to resources protected by the critical section (assume finite)
Solution strategies Deadlocks
-
Centralized strategy
- Divide processes into master and slaves, master dictates actions of slaves
-
Distributed strategy: Each process independently decides actions, based on local knowledge of others’ state
- Token‐based: A node is allowed in the critical section (CS) if it has a token. Tokens are passed from site to site, in some (priority) order.
- Non‐token‐based: A node enters CS when an assertion becomes true. A node communicates with other nodes to obtain their states and decides whether the assertion is true or false.
Centralized strategy
- Elect a leader (details, cf. second part of this lecture)
* Meets requirements: Safety, liveness, no starvation
* Does solution meet the ordering requirement?
* Advantages-
Simple to implement
* Disadvantages - Single point of failure
- Bottleneck, network congestion, timeout
-
Deadlock potential for multiple resources with separate servers
* Enter_CS() - Two messages: Request & Grant
- Delays the requesting process by round trip for messages
* Exit _CS() - One message: Release message
- Assuming asynchronous message passing, incurs no delay
-
Simple to implement

Distributed strategy
- In a distributed algorithm, the same decision must be made on each node, independent of the other nodes in the system.
- Selected algorithms
- Ring‐based algorithm
- Logical clocks‐based algorithm (Lamport, 1976)
- Ricart & Agrawala, 1981
- Maekawa, 1985
- Many more
Ring‐based algorithm
- Logical ring of processes
- Each Pi knows its successor, P(i+1) mod N
- Logical topology a priori unrelated to physical topology
Analysis:
- Safe: Node enters CS only if it holds the token
- Live: Since finite work is done by each node (can’t
re‐enter), the token eventually gets to each node
- Fair: Ordering is based on ring topology
- Performance
- Constantly consumes network bandwidth, even when no processes seek entry, except when inside the CS
Synchronization delay: Between 1 and N messages – Client delay: Between 0 and N messages; 1 for exit
- Problems
- Lost token
- Duplicated token
- Failed node

Ricart & Agrawala, 1981, algorithm
- Basic idea
- Processes wanting to enter CS, multicast a request to all processes
- Enter CS, once all have granted request
- Use Lamport timestamps… logical loch value to order requests: , t process identities T the timestamp, Pi the process identifier
- Each process is in one of three states
- Released ‐ Outside the CS
- Wanted ‐ Wanting to enter CS
- Held ‐ Inside CS
- If a process requests to enter CS and all other processes are in the Released state, entry is granted by each process
- If a process, Pi, requests to enter CS and another process, Pk, is inside the CS (Held state), then Pk will not reply, until it is finished with the CS

Ricart & Agrawala algorithm

Fault‐tolerance aspects
of mutual exclusion algorithms
- None of the algorithms tolerates message loss
- Ring‐based algorithm cannot tolerate crash of single process
- Centralized algorithm can tolerate crash of clients that are neither in the CS, nor have requested entry
- Described R & A does not tolerate faults either
Leader election
- Problem: A group of processes, P1, …, Pn, must agree on some unique Pk to be the “leader”
- Often, the leader then coordinates another activity
- Election runs when leader (a.k.a., coordinator) failed
- Any process who hasn’t heard from the leader in some predefined time interval may call for an election
- False alarm is a possibility (new election initiated, while current leader still alive)
- Several processes may initiate elections concurrently
- Algorithm should allow for process crash during election
Process identifier
- Elected leader must be unique
- The one with the largest identifier
- Identifier could be any “useful value” – I.e., unique & totally ordered
- E.g., based on OS process identifiers, IP adr., port
- E.g., based on least computational load
- <1/load, i>, load > 0, i is process ID to break ties
- Each process, Pi, has a variable electedi that holds
- the value of the leader or is “┴” (undefined)
Applications of leader election
- Berkeley clock synchronization algorithm
- Centralized mutual exclusion algorithm
- Primary‐backup replication algorithms
- Two‐phase commit protocol
- Used in Amazon Dynamo KV‐store (replication)
- Used in Google BigTable & GFS
As compared to mutual exclusion
- Losers return to what they were doing … … instead of waiting
-
Fast election is important …
… not starvation avoidance - All processes must know the result… … not just the winner
Election algorithm requirement
-
Safety
- A participating process, Pi, has variable electedi = “┴” or electedi = P, where P is chosen as the non‐crashed process at the end of the election run with the largest identifier.
-
Liveness
- All processes participate in the election and eventually either set electedi ≠ “┴” or crash.
-
Performance
- Total number of messages sent (bandwidth) –…
Ring‐based algorithm, 1978: Overview
- Essentially three phases
- Initialization
-
Election phase (concurrent calls for election)
- Determine election winner (voting phase)
- Reach point of message originator
- Announce the leader phase (victory announcement phase)
Ring‐based election algorithm
- Construct a ring (cf. ring‐based mutual exclusion)
- Assume, each P has unique ID
- Assume, no failures and asynchronous system
- Any Pi may begin an election by sending an election message to its successor
- Election message holds Pi’s ID
-
Upon receiving an election message, Pk compares its own
- ID to ID in received message
- If message ID is greater: Forward election message
- If … smaller: Forward election message with Pk’s ID, unless Pk has already been participating in the current election run
- If … equal: Pk is now leader. Forward victory message to notify all other processes

Different cases - Ring-based algorithm

Ring‐based algorithm: Example (Picture only)
- Construct a ring (cf. ring‐based mutual exclusion)
- Assume, each P has unique ID
- Assume, no failures and asynchronous system
- Any Pi may begin an election by sending an election message to its successor
- Election message holds Pi’s ID
- Upon receiving an election message, Pk compares its own
- ID to ID in received message
- If message ID is greater: Forward election message
- If … smaller: Forward election message with Pk’s ID, unless Pk has already been participating in the current election run
- If … equal: Pk is now leader. Forward victory message to notify all other processes

Ring‐based election algorithm analysis
- Worst case
- Single process calls for election
-
Anti‐clockwise neighbour has highest identifier
- N‐1 messages to reach this neighbour
- N messages to reach point of origin
- Leader announcement takes another N messages
- For a grand total of 3N – 1 messages
Bully algorithm, Garcia‐Molina, 1982
- Assumes each process has a unique ID, reliable message delivery, and synchronous system with timeouts
- Assumes processes know each others’ IDs and can communicate with one another
-
Higher IDs have priority
- Can “bully” lower numbered processes
-
Higher IDs have priority
- Initiated by any process that suspects failure of the leader
- Employs timeouts to detect failure
- Tolerates processes crashing during elections
Bully algorithm messages
- Operates with three types of messages
- Election announces an election
- Answer responds to an election message
- Coordination announces the identity of leader
- Algorithm is triggered by any process that detects (suspects) the leader to have crashed
Pi detects failure of leader
- For any j < i and any i < k
- Broadcasts election message
- Any Pk receiving election message responds with answer message and starts another election
- Any Pj receiving election message does not respond ( as
- If P does not receive any answer message (timeout) then it broadcasts victory via coordination message
- If Pi does receive answer message(s) then waits to receive coordinator message
- Restarts election, if no coordination message received

Upon crash of a process BULLY ALGO
- Start of a new process replacing crashed one with crashed one’s ID (e.g., read from disk)
-
Process may determine that it has the highest identifier, thus, pronounce itself as leader
- Even though system may have an existing leader (elected during crash)
- New process “bullies” current leader