Data Intensive Ch9 - Consistency and Consensus Flashcards

Question

How to make linearizable reads using total order broadcast

Answer 1

1. Append the "sync" message and wait for it to be delivered back then perform read - guaranteed consistency of db state at "sync" sequence number point-in-time 2. If it's possible to fetch the position of the latest log message (in linearizable way) - query the position and wait for all entries up to that position to be delivered. Then perform the read (ZooKeeper sync() op). 3. Make a red from a replica that is sync updated on writes (which must be up to date)

Answer 2

Assume linearizable integer register with increment-and-get operation (or atomic CAS) For every message to be send through TOB - bump the register and attach its value as a sequence number For fault tolerant system such storage is not trivial Both problems reduce to consensus

Answer 3

Unlike fex Lamport TS numbers for TOB form CONTINUOUS sequence (no gaps) If node delivered msg 4 and received msg 6 then it KNOWS it must wait for msg 5

Answer 4

Fischer, Lynch & Paterson proof that consensus is impossible given there is a risk node may crash (always a case for distributed system) Assumption - async system model - no timeouts, no clocks, strictly deterministic algorithm

Answer 5

1. some nodes may detect a constraint violation/conflict - meaning abort is required. Other nodes might have already committed. 2. Some commit request may get lost in the network (and abort after timeout) while some may get through 3. Some nodes may crash before they actually handle commit Trx commit must be irrevocable - after commit it becomes visible to other trx!

Answer 6

Algorithm for achieving atomic trx commit across multiple nodes - all nodes either commit or abort Available to apps in form of XA Transactions (supported by JTA fex) 2PC uses coordinator/trx manager component Can be same process as app requesting trx or separate Algorithm: 1. App reads and writes data to multiple db nodes as usual (partiipants of trx) 2. When app wants to commit coordinator begins PHASE 1 - sends PREPARE request to each participant Participants should check whether they are able to commit (constraint violations etc) 3. Coordinator gathers all responses - if all replied "yes" then coordinator sends PHASE 2 COMMIT request and commit actually takes place - if any replied "no" then coordinator sends PHASE 2 ABORT request Also called BLOCKING atomic commit - nodes can become stuck waiting for the coordinator to recover

Answer 7

Detail breakdown 1. App starts trx - gets globally unique trx id from coordinator 2. App begins single node trx on each participant + attaches trx id - anything goes wrong now abort is easy 3. when app is ready to commit coord sends prepare tagged with trx id - any request fails - abort is sent with the same trx id 4. When participant receives prepare it makes sure that it can DEFINITELY commit trx under ALL CIRCUMSTANCES Trx data is written to disk (so commit can be done even if power failure/no disk space happens) Constraints and conflicts are checked If "yes" is responded then participant YIELD the right to ABORT 5. When coord got all responses it makes DEFINITIVE decision whether to commit or abort Decision is written to trx log on disk - COMMIT POINT 6. After securing the decision - commit or abort is sent out If any request fails here - RETRY (forever until success!) No going back - if participant crashes then after recovery it MUST accept the request from coord 2 points of NO RETURN - participant says yes in prepare - coord makes definite decision

Answer 8

Participant can only safely abort on its own before responding "yes" to prepare request After that it MUST hear back from the coordinator If coordinator crashes or network fails - trx is in the state "in doubt" or "uncertain" Participant CANNOT abort because coordinator might have already committed elsewhere Participant CANNOT commit because some other participant could say "no" Participant MUST wait for coord Coord MUST save its decision to trx log BEFORE sending it out

Answer 9

One or more nodes may PROPOSE values Algorithm DECIDES on ONE of those values Properties: Uniform Agreement - no two nodes decide differently Integrity - no node decides twice Validity - if node decides value v then v was proposed by SOME node (no algorithms always deciding null) Termination - every node that does not crash eventually decides some value If no fault tolerance is needed - one node can be hardcoded as "dictator" (like coord in 2PC) Hence termination property - if a node fail other nodes are expected to reach a decision anyway Consensus requires at least a majority of nodes to be running (or no quorum can be formed) So termination property assumes less than half of nodes can crash ``` Examples: - Viewstamped Replication VSR (TOB) - Paxos (Multi-Paxos for TOB version) - Raft (TOB) - Zab (TOB) Most of them decide on sequence of values (making them total order broadcast - more efficient than doing repeated rounds of one-value-at-a-time consensus) ```

Answer 10

Repeated rounds of consensus - each round node poposes the message to be sent next and decide on the next message to be delivered in the total order Each decision = 1 message delivery

Answer 11

Everytime there seems to be no leader a vote is started among the nodes Each election is given EPOCH NUMBER (which is totally ordered and monotonically increasing) If there is a conflict between 2 leaders in 2 different epochs - leader with the higher epoch number prevails Node that wants to become a leader must collect votes from a quorum of nodes Node votes in favor of a proposal if it is not aware of any other leader with higher epoch There're 2 rounds of voting each epoch Elect a leader Vote on leader's proposal Key insight: quorums of those 2 votes must overlap - if a vote on a proposal succeeded at least one of the nodes that voted for it must have also participated in the most recent leader election If vote on proposal does not reveal higher epoch leader then current leader can conclude it still holds the leadership and decide the proposed value

Answer 12

Coordinator is not elected in 2PC Conensus requires votes from MAJORITY of nodes 2PC requires "yes" from all participants Consensus defines recovery process (nodes can get into a consistent state after a new leader is elected)

Answer 13

Voting is similar to SYNC db REPLICATION Requires a STRICT MAJORITY of nodes to operate Most consensus algorithms assume FIXED set of nodes that vote Uses TIMEOUTS as FAILURE DETECTOR - when there is high variability in network consensus can become election-fest

Answer 14

Hold small amounts of data fitting in memory (disk writes for durability) All the data is replicated using fault-tolerant total order boradcast

Answer 15

Linearizable atomic operations - can be used to implement lock/lease Total ordering of operations - can be used for implementing fencing token (prevent old leases from being used if process is paused) Failure detection - clients can maintain long-lived session on ZooKeeper servers. Heartbeats are exchanged periodically. ZooKeeper can auto-release all locks held by a session when it times out (ephemeral nodes) Change notifications - clients can watch for changes (like new node joining the cluster or node failures) . Notifications can be subscribed, no polling required All in all - ZooKeeper has a useful set of features for distribute coordination

Answer 16

Allocating work to nodes Partitioned resource (like message stream shard in Kinesis). New node joins the cluster and some work should be moved from existing nodes to the new one - rebalancing partitions. If node is removed or has failed - same story. ``` How to do it with ZooKeper: combine atomic (linearizable) operations, ephemeral nodes (failure detection) and change notifications ``` Supposedly it's not easy (even when using higher level APIs like Apache Curator) but still easier than fault-tolerant consensus from the scratch (poor success record it is said) Key idea: application may grow to thousand of nodes so majority voting would become ineffectual. ZooKeeper, usually run on FIXED NUMBER OF NODE (3/5), allows to OUTSOURCE coordination work (consensus, ordering of operations, failure detection). ZooKeeper is intended for SLOW-CHANGING DATA (node is running on IP , partition assigned) Timescale of minutes/hours not millions of times per second

Data Intensive Ch9 - Consistency and Consensus Flashcards

(40 cards)