Data Intensive Ch9 - Consistency and Consensus Flashcards
Context map
Timestamp ordering
Lamport timestamps
Causal ordering
Total order broadcast
Distributed transactions
Linearizability
Compare-and-set
Global constraints
Failure detectors - membership services
Summary
Linearizability - consistency model
Goal - make replicated data appear as though there was only a single copy; make all ops appear as if they acted on that copy atomically
Totally ordered, single timeline
Variable in single threaded alike; slow; susceptible to network delays
Causality - imposes ordering on events - what happened before what; cause and effect
Weaker consistency model - some things can be concurrent
Timeline with branching & merging
Less sensitive to network problems - less coordination overhead
Can be captured using Lamport timestamps
Causality not sufficient for ensuring name uniqueness fex
Consensus required
Achieving consensus - deciding something in a way all nodes agree on what was decided - decision is irrevocable
Many problems are reducible to consensus
- > Linearizable compare-and-set (atomically decide whether to set value based on current value and passed)
- > Atomic distributed trx commit
- > Total order broadcast (decide order in which to deliver messages)
- > Lock/lease (decide who acquired one)
- > membership/coordination service (given failure detector like timeout decide which node is alive)
- > Uniqueness constraint (decide who wrote the value first and who should abort)
Consensus is easy in single node environment or only a single node can decide. Single leader database is an example. SPOF; can be handled by:
- waiting for the leader to recover (what if they don’t? termination property violated - system blocked forever)
- Consensus by act of God -> manual fail over - admin chooses new leader and reconfigures the cluster
Humans are SLOW
- Algorithmically auto select new leader - CONSENSUS ALG
Single leader DB provides linearizability for writes without consensus but requires it to maintain its leadership!
Leader “KICKS THE CAN DOWN THE ROAD”
Outsourcing consensus, failure detection & membership service using ZooKeeper and similar
Easier than implementing custom algorithms able to withstand: partial failure (network packet lost), process pauses, clock drifts
Not all systems require consensus - leaderless, multi-leader rely on handling conflicts instead
Causality is enough no need for linearizability
Idea behind building fault-tolerant systems
Find general-purpose abstractions with useful guarantees, implement them once and let apps rely on them.
Database transactions as an example - apps can pretend there’re no crashes (atomicity), nobody accesses db at the same time (isolation), storage devices are reliable (durability)
Eventual consistency revisited
For replicated databases - if you stop writing to DB for some UNSPECIFIED length of time then all reads will finally return the same value
Weak guarantee - no guarantees when replicas converge
No guarantees for reading your own writes
Problems appear only when there is high concurrency or network issues.
Distributed consistency vs transaction isolations
Trx isolation primarily about avoiding race conditions due to concurrently executing trx
DC - coordinating state of replicas in the face of delays and faults
Lenearizability
Atomic consistency, strong consistency, immediate consistency, external consistency
Idea: All clients have the same view of the data as if there is only one replica. No worries about replication lag
Constraint - after any read has returned a new value ALL following reads from any replica must also return the new value
Imagine there is some point in time (between start and end of the write) where value of x flips from old to the new value everywhere
Linearizability vs Serializability
Different guarantees
Serializability - trx isolation property
Every trx may read/write many OBJECTS
Trx behave though as if they had executed in SOME serial order. Serial order may not be the order in which trx were actually run.
Linearizability - guarantees on reads and writes of an INDIVIDUAL object
Does not group operations into trx
Does not prevent write skews
2PL, literal serial execution are typically linearizable as well
SSI is not - reads are made from consistent snapshot
Snapshot does not include more recent writes than the snapshot
Where do we need linearizability
Locking/leader election
Only single leader allowed (no split brain)
Once lock is acquired all reads must acknowledge that
Uniqueness and other constraints
Same as acquiring a lock for given name
Bank account cannot go below zero
Cross-channel timing dependencies
Image resizer - queue fetching an image from a storage. This may happen before new version of an image is stored.
Which replication methods are linearizable?
Single leader - potentially yes - reads made from leader or sync update followers. If snapshot isolation used then nop.
If delusional leader continues to serve requests - likely to violate
Async replication, failover - committed writes can be lost - violates durability and linearizability
Consensus - we can implement linearizable storage using it
Multi-leader - nope, concurrently process writes on multiple nodes; async replication to other nodes; this produces conflicts
Leaderless - probably not - quorum reads not really
Strict quorum does not help (w+r>n) because of variable network delays (we can read from the 2 replicas that didn’t get an update yet and other reader can read from the one value was written initially)
Fig 9-6 p334
How to make Dynamo-style quorums linearizable?
Reader must perform a sync read repair
Writer must read the latest state of a quorum of nodes before sending its writes
Cost - reduced performance
p335
Why is CAP theorem unhelpful?
The definition “Consistency, Availability, Partition (tolerance) - pick 2 out of 3” is misleading
Partitioning is a network fault which is not chosen - it just happens
If network is working correctly - system can provide consistency (linearizability) and total avaiability but when network fails - you choose either of the two.
Recommended to avoid CAP
Why are RAM and multicore CPU not linearizable system?
One Thread writing to a variable to the memory and other reading it shortly afterwards - read can be stale due to use of cache.
Trade-off for LATENCY
Linearizability is dropped for performance, not fault toerance.
Linearizability is slow always, not only during a network fault.
Using CAP here makes no sense
C is dropped yet we don’t expect system to be available if there is a partitioning (CPU disconnected from the system)
Causal consistency
System obeys the ordering imposed by causality
cause comes before effect, message is sent before it is received, quest is asked before it’s answered
Causal order - what happened before what
Examples:
Snapshot isilation - consistent snapshot as causally consistent
Causal order is a partial order
2 concurrent events neither is greater or lower
Strongest possible consistency model that does not slow down due to network delays and remains available in the face of network failures!
Which kind of order is linearizability?
Total order
IF system behaves as if there is single copy of data and all operations are atomic then for any 2 opeartions one must have happened before the other
There is no concurrency in linearizable datastore as there is always a single, linear timeline
Linearizability vs causality
Linearizability implies causailty
Capturing causal dependencies
If. replica processes operation then it must ensure ALL causaully preceding operations (which happened before) have already been processed. Otherwise processing must be postponed.
System needs a way to know which value system has seen when it made given write.
If node seen X when it write Y then X is causally related to Y
Causal dependencies must be tracked for across entire database (all objects) not just a single object.
Version vectors can be used for this
DB needs to know which version of data was read by the application; Snapshot version numbers being passed on a write.
Sequence number ordering
Causality is THEORETICAL concept
Actual keeping track of all causal dependencies can be an overkill
Clients usually read lots of data before writing something -> unclear which data write depends on: all? only some?
Sequence numbers/timestamp (not time-of-day but logical clock ones) provide a better way.
Algorithm to generate sequence of numbers that identify operations (counter incrementing every op executed)
Sequence numbers provide total order (one is always comparable to another)
Sequence number is consistent with causality -> if A happened before B then sequence number A must be lower than B. For concurrent writes - order is aribtrary.
Single-leader DBs - replication log defines total order of writes
When follower applies writes in replication-log order then followers are always CAUSALLY CONSISTENT (even if lagging behind)
How to generate sequence numbers for operations in leaderless or multi-leader environment?
Each node generates seq number independently. Some bits are reserved for unique node ID.
Timestamp from time-of-day clock can be attached to each operation. With sufficient resolution they might provide total order of operations.
Preallocation of blocks of sequence numbers. Node A might claim sequence numbers from 1-1000 and B from 1001 to 2000
This perform and scales better than bottleneck-pushing through single leader BUT they lose consistency with causality
Causality problem with distributed sequence generators
Each node may process different number of operations per second
Imagine 2 nodes - one generates odd and another even numbers sequence.
Odd-numbered op cannot be compared with even numbered one.
Timestamps are prone to clock-skew
Block allocator ccase - one operation may be assigned number from larger range sequence than another even though it happened after.
Lamport timestamps
Purpose: Generating sequence numbers consistent with causality
(node counter, node ID) - node ID guarantees uniqueness
Every node keeps track of max counter value they have seen so far; max value is included on every request
If node receives request/response with greater counter value they update their own counter
Fig 9-8, p346
Version vectors vs lamport timestamps
First - distinguish whether 2 operations are concurrent or causally dependent on the other
Second - enforce total ordering from which it’s impossible to tell whether 2 ops are concurrent or dependent
Why total ordering is not enough to solve fex uniqueness problem?
Total ordering lets determine the winner AFTER. the fact
Given a request to create an user node cannot tell whether there is a concurrent attempt with the same username unless it asks all nodes. Which is obviously not fault tolerant.
Total order broadcast
Protocol for exchaning messages between nodes.
Safety. properties required:
- reliable delivery -> no message is lost, if it’s delivered to one node then must be to all
- total ordered delivery - messages are delivered to every node in the same order
STATE MACHINE REPLICATION
Useful to DB replication - each message = write to db
Every replica processes writes in the same order - they remain consistent with each other (aside from replication lag)
It’s a way of creating a log (replication, trx, write ahead log etc). Delivering a message = appending to log
All nodes mmust deliver the same message in the same order, all nodes can read the log and see the same sequence of messages
Useful for implementing a lock service for fencing tokens
Sequence number can be fencing token itself
Linearizable storage with total order broadcast
TOB is async - messages guaranteed to delivered reliably in fixed order EVENTUALLY
TOB can be used to implement linearizable WRITE
Unique username example of implementing linearizable compare-and-set using total order broadcast
1. Append a message to the log
This is tentative reservation for the username
2. Read the log until reservation message is delivered back
3. check for any messages claimingthe same username
- if it’s my message. - reservation successful
- otherwise - abort
All messages are delivered in the same order so in case of many concurrent writes - all nodes would agree which came first
How to make linearizable reads using total order broadcast
- Append the “sync” message and wait for it to be delivered back then perform read - guaranteed consistency of db state at “sync” sequence number point-in-time
- If it’s possible to fetch the position of the latest log message (in linearizable way) - query the position and wait for all entries up to that position to be delivered. Then perform the read (ZooKeeper sync() op).
- Make a red from a replica that is sync updated on writes (which must be up to date)
Total order broadcast using linearizable storage
Assume linearizable integer register with increment-and-get operation (or atomic CAS)
For every message to be send through TOB - bump the register and attach its value as a sequence number
For fault tolerant system such storage is not trivial
Both problems reduce to consensus
Key difference between total order broadcast and timestamp ordering
Unlike fex Lamport TS numbers for TOB form CONTINUOUS sequence (no gaps)
If node delivered msg 4 and received msg 6 then it KNOWS it must wait for msg 5
What is FLP result
Fischer, Lynch & Paterson proof that consensus is impossible given there is a risk node may crash (always a case for distributed system)
Assumption - async system model - no timeouts, no clocks, strictly deterministic algorithm
Why is it not sufficient to send a commit request to all nodes in distributed atomic commit?
- some nodes may detect a constraint violation/conflict - meaning abort is required. Other nodes might have already committed.
- Some commit request may get lost in the network (and abort after timeout) while some may get through
- Some nodes may crash before they actually handle commit
Trx commit must be irrevocable - after commit it becomes visible to other trx!
Two phase commit
Algorithm for achieving atomic trx commit across multiple nodes - all nodes either commit or abort
Available to apps in form of XA Transactions (supported by JTA fex)
2PC uses coordinator/trx manager component
Can be same process as app requesting trx or separate
Algorithm:
1. App reads and writes data to multiple db nodes as usual (partiipants of trx)
2. When app wants to commit coordinator begins PHASE 1 - sends PREPARE request to each participant
Participants should check whether they are able to commit (constraint violations etc)
3. Coordinator gathers all responses
- if all replied “yes” then coordinator sends PHASE 2 COMMIT request and commit actually takes place
- if any replied “no” then coordinator sends PHASE 2 ABORT request
Also called BLOCKING atomic commit - nodes can become stuck waiting for the coordinator to recover
2PC atomicity
Detail breakdown
1. App starts trx - gets globally unique trx id from coordinator
2. App begins single node trx on each participant + attaches trx id - anything goes wrong now abort is easy
3. when app is ready to commit coord sends prepare tagged with trx id - any request fails - abort is sent with the same trx id
4. When participant receives prepare it makes sure that it can DEFINITELY commit trx under ALL CIRCUMSTANCES
Trx data is written to disk (so commit can be done even if power failure/no disk space happens)
Constraints and conflicts are checked
If “yes” is responded then participant YIELD the right to ABORT
5. When coord got all responses it makes DEFINITIVE decision whether to commit or abort
Decision is written to trx log on disk - COMMIT POINT
6. After securing the decision - commit or abort is sent out
If any request fails here - RETRY (forever until success!)
No going back - if participant crashes then after recovery it MUST accept the request from coord
2 points of NO RETURN
- participant says yes in prepare
- coord makes definite decision
What if coordinator fails?
Participant can only safely abort on its own before responding “yes” to prepare request
After that it MUST hear back from the coordinator
If coordinator crashes or network fails - trx is in the state “in doubt” or “uncertain”
Participant CANNOT abort because coordinator might have already committed elsewhere
Participant CANNOT commit because some other participant could say “no”
Participant MUST wait for coord
Coord MUST save its decision to trx log BEFORE sending it out
Fault tolerant consensus
One or more nodes may PROPOSE values
Algorithm DECIDES on ONE of those values
Properties:
Uniform Agreement - no two nodes decide differently
Integrity - no node decides twice
Validity - if node decides value v then v was proposed by SOME node (no algorithms always deciding null)
Termination - every node that does not crash eventually decides some value
If no fault tolerance is needed - one node can be hardcoded as “dictator” (like coord in 2PC)
Hence termination property - if a node fail other nodes are expected to reach a decision anyway
Consensus requires at least a majority of nodes to be running (or no quorum can be formed)
So termination property assumes less than half of nodes can crash
Examples: - Viewstamped Replication VSR (TOB) - Paxos (Multi-Paxos for TOB version) - Raft (TOB) - Zab (TOB) Most of them decide on sequence of values (making them total order broadcast - more efficient than doing repeated rounds of one-value-at-a-time consensus)
Total order broadcast viewed as consensus
Repeated rounds of consensus - each round node poposes the message to be sent next and decide on the next message to be delivered in the total order
Each decision = 1 message delivery
How does consensus algorithm elects a leader when there is no leader?
Everytime there seems to be no leader a vote is started among the nodes
Each election is given EPOCH NUMBER (which is totally ordered and monotonically increasing)
If there is a conflict between 2 leaders in 2 different epochs - leader with the higher epoch number prevails
Node that wants to become a leader must collect votes from a quorum of nodes
Node votes in favor of a proposal if it is not aware of any other leader with higher epoch
There’re 2 rounds of voting each epoch
Elect a leader
Vote on leader’s proposal
Key insight: quorums of those 2 votes must overlap - if a vote on a proposal succeeded at least one of the nodes that voted for it must have also participated in the most recent leader election
If vote on proposal does not reveal higher epoch leader then current leader can conclude it still holds the leadership and decide the proposed value
Difference between voting in fault tolerant consensus to 2PC
Coordinator is not elected in 2PC
Conensus requires votes from MAJORITY of nodes
2PC requires “yes” from all participants
Consensus defines recovery process (nodes can get into a consistent state after a new leader is elected)
Limitation of consensus
Voting is similar to SYNC db REPLICATION
Requires a STRICT MAJORITY of nodes to operate
Most consensus algorithms assume FIXED set of nodes that vote
Uses TIMEOUTS as FAILURE DETECTOR - when there is high variability in network consensus can become election-fest
What are ZooKeeper and etcd designed for primarily?
Hold small amounts of data fitting in memory (disk writes for durability)
All the data is replicated using fault-tolerant total order boradcast
Which features does ZooKepper provide?
Linearizable atomic operations
- can be used to implement lock/lease
Total ordering of operations
- can be used for implementing fencing token (prevent old leases from being used if process is paused)
Failure detection
- clients can maintain long-lived session on ZooKeeper servers. Heartbeats are exchanged periodically. ZooKeeper can auto-release all locks held by a session when it times out (ephemeral nodes)
Change notifications
- clients can watch for changes (like new node joining the cluster or node failures) . Notifications can be subscribed, no polling required
All in all - ZooKeeper has a useful set of features for distribute coordination
Example of using ZooKeeper in place of implementing in-house consensus
Allocating work to nodes
Partitioned resource (like message stream shard in Kinesis). New node joins the cluster and some work should be moved from existing nodes to the new one - rebalancing partitions.
If node is removed or has failed - same story.
How to do it with ZooKeper: combine atomic (linearizable) operations, ephemeral nodes (failure detection) and change notifications
Supposedly it’s not easy (even when using higher level APIs like Apache Curator) but still easier than fault-tolerant consensus from the scratch (poor success record it is said)
Key idea: application may grow to thousand of nodes so majority voting would become ineffectual.
ZooKeeper, usually run on FIXED NUMBER OF NODE (3/5), allows to OUTSOURCE coordination work (consensus, ordering of operations, failure detection).
ZooKeeper is intended for SLOW-CHANGING DATA (node is running on IP , partition assigned)
Timescale of minutes/hours not millions of times per second