8. The Trouble with Distributed Systems Flashcards
Problems of distributed systems - 1. Unreliable network
Whenever you try to send a packet over the network, it may be lost, reordered, duplicated or arbitrarily delayed. Likewise, the reply may be lost or delayed, so if you don’t get a reply, you have no idea whether the message got through.
Detect faults is hard
Most distributed algorithms reply on timeout to determine whether a remote node is still available. However, timeouts can’t distinguish between network and node failures.
Tolerate faults is hard
There is no global variable, no shared memory, no common knowledge or any other kind of shared state between the machines.
Nodes can’t even agree on what time it is, let alone on anything more profound.
The only way information can flow from one node to another is by sending it over the unreliable network. Major decisions cannot be safely made by a single node, so we require protocols that enlist help from other nodes and try to get a quorum to agree.
Problems of distributed systems - 2. Unreliable clocks
A node’s clock may be significantly out of sync with other nodes (despite our best efforts to set up NTP), it may suddenly jump forward or back in time, and replying on it is dangerous because you most likely don’t have a good measure of your clock’s error interval.
Problems of distributed systems - 3. Process pause
A process may pause for a substantial amount of time at any point in its execution (perhaps due to a stop-the-world garbage collector), be declared dead by other nodes, and then come back to life again without realizing that it was paused.