Process and communication channel may fail, i.e., depart from correct and expected behaviour Failure model offers understanding for effects of failures Failure types Omission failures (one ore more responses fail) Arbitrary failures (wrong value) Timing failures (outside interval)

Encompasses prior failure classes Process not just stops or crashes but processes requests incorrectly, corrupts state, produces inconsistent & incorrect output (commission failures) Does not follow agreed protocol, messages get corrupted, messages introduced Result:Systembehavesunpredictable A.k.a. Byzantine failures / Produces --\> produces wrong output

Chapter 5: Coordination & Agreement Flashcards by Johannes Waldorf

Synchronous distributed system model

Each message transmitted over a channel is received within a known bounded time
Time to execute each step of a process has known lower and upper bounds
Each process has a local clock whose driftrate from real time has a known bound

How well did you know this?

Not at all

Perfectly

Asynchronous distributed system model NN

Messages may need an arbitrary time to be transmitted
Each step of a process may take an arbitrary time to complete
Clock driftrates are arbitrary

How well did you know this?

Not at all

Perfectly

Thought experiment - Some takeaways

Asynchronous model is representative for Internet and many practical distributed applications
Solution valid for asynchronousd distributed systems are also valid for synchronous ones
Many design problems can not besolved in an asynchronous world (e.g., when vs. who leads attack)
Applyt timeouts and timing constraints to reduce uncertainty and to bring synchronous model into the picture

How well did you know this?

Not at all

Perfectly

Failure models

Process and communication channel may fail, i.e., depart from correct and expected behaviour
Failure model offers understanding for effects of failures
Failure types
- Omission failures (one ore more responses fail)
- Arbitrary failures (wrong value)
- Timing failures (outside interval)

How well did you know this?

Not at all

Perfectly

Process omission failures

Fails by halting, i.e., crashes (e.g. bluescreen)
Fail‐stop failure
- Process halts and remains halted
- Others can detect this state
Crash failure
- Process halts and remains halted
- Others can (only) suspect this state

How well did you know this?

Not at all

Perfectly

Communication omission failure

Receiving process does not receive message
Typically becaue of
- Buffer overlfow
- Network partition
Failure types
- Send-omission
- Receive-omission
- Channel-omission
Process vs. communication failure: Generally, impossible to distinguish

How well did you know this?

Not at all

Perfectly

In our thought experiment NN

Could Army i detect whether Army j has been defeated? (i.e., process failed?)
Assuming the enemy could attack
While undefeated, send periodic message (e.g. hearbeat)
- Assume enemy can not corrupt message

How well did you know this?

Not at all

Perfectly

Arbitrary failures

Encompasses prior failure classes
Process not just stops or crashes but processes requests incorrectly, corrupts state, produces inconsistent & incorrect output (commission failures)
Does not follow agreed protocol, messages get corrupted, messages introduced
Result:Systembehavesunpredictable
A.k.a. Byzantine failures / Produces –> produces wrong output

How well did you know this?

Not at all

Perfectly

Access to shared variables and RACE condition

Imaging a globally shared variable counter in a process accessible to multiple threads:
- counter ++
- counter –
- Register could be 4, 6 or 5
Race condition:
- Several threads manipulate shared data concurrently. The final value of the data depends upon which thread finishes last.
- To prevent race conditions, concurrent processes must be synchronized
The statements: counter++; counter‐‐; must each be executed atomically.
Atomic operation means an operation that completes in its entirety without interruption.
This is achieved through synchronization primitives
Shared variable accessed in critical section, protected by synchronization primitives
Known as the critical section problem or as mutual exclusion

How well did you know this?

Not at all

Perfectly

Distributed mutual exclusion

In distributed systems, mutual exclusion is more complex due to lack of:
- Shared memory
- Timing issues
- Lack of a global clock
- Event ordering
Applications
- Accessing a shared resource in distributed systems
- Communicating over a wireless channel 802.11

How well did you know this?

Not at all

Perfectly

Critical section (CS) problem: No shared memory

System with n processes
Processes access shared resources in CS
Coordinate access to CS via message passing
Application‐level protocol for accessing CS
- Enter_CS() – enter CS, block if necessary
- ResourceAccess() – access shared resource in CS
- Exit_CS() – leave CS

How well did you know this?

Not at all

Perfectly

Assumptions for CS NN

System is asynchronous
- No bound on delays, no bound on clock drift, etc.
Processes do not fail
Message delivery is reliable
- Any message sent, is eventually delivered intact and exactly once
Client processes are well‐behaved and spent finite time accessing resources in the CS

How well did you know this?

Not at all

Perfectly

Mutual exclusion requirements CS

Safety
- At most one process in the critical section at a time
Liveness
- Requests to enter & exit CS eventually succeed
- No deadlock
Fairness (order & starvation)
- If one request to enter CS happened‐before another one, then entry to CS is granted in that order
- Requests are ordered such that no process enters the critical section twice while another waits to enter (i.e., no starvation)

How well did you know this?

Not at all

Perfectly

Deadlock & starvation

Deadlock: Two or more processes become stuck indefinitely while attempting to enter and exit the CS, by virtue of their mutual dependency
Starvation: The indefinite postponement of entry to CS for a process that has requested it.
Can we order entry to the CS by the times processes requested it?

How well did you know this?

Not at all

Perfectly

Performance metrics NN

Bandwidth: Number of messages sent, received or both
Synchronization delay: Time between one process exiting the critical section and the next entering
Client delay: Delay at entry and exit (response time)
We do not measure client access to resources protected by the critical section (assume finite)

How well did you know this?

Not at all

Perfectly

Solution strategies Deadlocks

Centralized strategy
- Divide processes into master and slaves, master dictates actions of slaves
Distributed strategy: Each process independently decides actions, based on local knowledge of others’ state
- Token‐based: A node is allowed in the critical section (CS) if it has a token. Tokens are passed from site to site, in some (priority) order.
- Non‐token‐based: A node enters CS when an assertion becomes true. A node communicates with other nodes to obtain their states and decides whether the assertion is true or false.

Centralized strategy

Elect a leader (details, cf. second part of this lecture)
* Meets requirements: Safety, liveness, no starvation
* Does solution meet the ordering requirement?
* Advantages
- Simple to implement
  * Disadvantages
- Single point of failure
- Bottleneck, network congestion, timeout
- Deadlock potential for multiple resources with separate servers
  * Enter_CS()
- Two messages: Request & Grant
- Delays the requesting process by round trip for messages
  * Exit _CS()
- One message: Release message
- Assuming asynchronous message passing, incurs no delay

Distributed strategy

In a distributed algorithm, the same decision must be made on each node, independent of the other nodes in the system.
Selected algorithms
- Ring‐based algorithm
- Logical clocks‐based algorithm (Lamport, 1976)
- Ricart & Agrawala, 1981
- Maekawa, 1985
- Many more

Ring‐based algorithm

Logical ring of processes
Each Pi knows its successor, P(i+1) mod N
Logical topology a priori unrelated to physical topology

Analysis:

Safe: Node enters CS only if it holds the token
Live: Since finite work is done by each node (can’t

re‐enter), the token eventually gets to each node

Fair: Ordering is based on ring topology
Performance
- Constantly consumes network bandwidth, even when no processes seek entry, except when inside the CS

Synchronization delay: Between 1 and N messages – Client delay: Between 0 and N messages; 1 for exit

Problems
- Lost token
- Duplicated token
- Failed node

Ricart & Agrawala, 1981, algorithm

Basic idea
- Processes wanting to enter CS, multicast a request to all processes
- Enter CS, once all have granted request
Use Lamport timestamps… logical loch value to order requests: , t process identities T the timestamp, P_i the process identifier
Each process is in one of three states
- Released ‐ Outside the CS
- Wanted ‐ Wanting to enter CS
- Held ‐ Inside CS
If a process requests to enter CS and all other processes are in the Released state, entry is granted by each process
If a process, Pi, requests to enter CS and another process, Pk, is inside the CS (Held state), then Pk will not reply, until it is finished with the CS

Ricart & Agrawala algorithm

Fault‐tolerance aspects
of mutual exclusion algorithms

None of the algorithms tolerates message loss
Ring‐based algorithm cannot tolerate crash of single process
Centralized algorithm can tolerate crash of clients that are neither in the CS, nor have requested entry
Described R & A does not tolerate faults either

Leader election

Problem: A group of processes, P1, …, Pn, must agree on some unique Pk to be the “leader”
Often, the leader then coordinates another activity
Election runs when leader (a.k.a., coordinator) failed
Any process who hasn’t heard from the leader in some predefined time interval may call for an election
False alarm is a possibility (new election initiated, while current leader still alive)
Several processes may initiate elections concurrently
Algorithm should allow for process crash during election

Process identifier

Elected leader must be unique
The one with the largest identifier
Identifier could be any “useful value” – I.e., unique & totally ordered
E.g., based on OS process identifiers, IP adr., port
E.g., based on least computational load
- <1/load, i>, load > 0, i is process ID to break ties
Each process, Pi, has a variable electedi that holds
the value of the leader or is “┴” (undefined)

Applications of leader election

* **Berkeley clock synchronization algorithm** * **Centralized mutual exclusion algorithm** * Primary‐backup replication algorithms * Two‐phase commit protocol * Used in Amazon Dynamo KV‐store (replication) * **Used in Google BigTable & GFS**

As compared to mutual exclusion

* **Losers return to what they were doing ... ... instead of waiting** * **Fast election** is important ... ... not starvation avoidance * **All processes must know the result**... ... not just the winner

Election algorithm requirement

* **Safety** * A participating process, Pi, has variable electedi = “┴” or electedi = P, where P is chosen as the non‐crashed process at the end of the election run with the largest identifier. * **Liveness** * All processes participate in the election and eventually either set electedi ≠ “┴” or crash. * **Performance** * Total number of messages sent (bandwidth) –...

Ring‐based algorithm, 1978: Overview

* Essentially three phases 1. **Initialization** 2. **Election phase** (concurrent calls for election) * Determine election winner (voting phase) * Reach point of message originator 3. **Announce the leader phase** (victory announcement phase)

Ring‐based election algorithm

* Construct a ring (cf. ring‐based mutual exclusion) * Assume, **each P has unique ID** * Assume, **no failures and asynchronous system** * **Any Pi may begin an election** **by sending an election message to its successor** * **Election message holds Pi’s ID** * **Upon receiving** an election message, **Pk compares its own** * **ID** to ID in received message * **If message ID is greater:** **Forward** **_election message_** * **If ... smaller:** **Forward _election message_ with Pk’s ID,** unless Pk has already been participating in the current election run * **If ... equal: Pk is now leader.** **Forward _victory message_ to notify all other processes**

Different cases - Ring-based algorithm

Ring‐based algorithm: Example (Picture only)

* Construct a ring (cf. ring‐based mutual exclusion) * Assume, each P has unique ID * Assume, no failures and asynchronous system * Any Pi may begin an election by sending an election message to its successor * Election message holds Pi’s ID * Upon receiving an election message, Pk compares its own * ID to ID in received message * If message ID is greater: Forward election message * If ... smaller: Forward election message with Pk’s ID, unless Pk has already been participating in the current election run * If ... equal: Pk is now leader. Forward victory message to notify all other processes

Ring‐based election algorithm analysis

* **Worst case** * Single process calls for election * **Anti‐clockwise neighbour has highest identifier** * N‐1 messages to reach this neighbour * N messages to reach point of origin * Leader announcement takes another N messages * For a grand total of 3N – 1 messages

Bully algorithm, Garcia‐Molina, 1982

* Assumes **each process has a unique ID**, **reliable message delivery, and synchronous system with timeouts** * Assumes **processes know each others’ IDs** and can communicate with one another * **Higher IDs have priority** * Can “bully” lower numbered processes * Initiated by any process that suspects failure of the leader * Employs timeouts to detect failure * Tolerates processes crashing during elections

Bully algorithm messages

* Operates with three types of messages * **Election announces an election** * **Answer responds to an election message** * **Coordination announces the identity of leader** * Algorithm is triggered by any process that detects (suspects) the leader to have crashed

Pi detects failure of leader

* For any j \< i and any i \< k 1. **Broadcasts election message** 2. **Any Pk receiving election message responds with answe**r message and **starts another election** 3. Any Pj receiving election message does not respond ( as 4. **If P does not receive any answer message (timeout) then it broadcasts victory via coordination message** 5. **If Pi does receive answer message(s) then waits to receive coordinator message** 6. Restarts election, if no coordination message received

Upon crash of a process BULLY ALGO

* **Start** of a **new process replacing crashed one with crashed one’s ID (e.g., read from disk)** * **Process may determine that it has the highest identifier, thus, pronounce itself as leader** * Even though system may have an existing leader (elected during crash) * **New process “bullies” current leader**