Data Intensive Ch8 - Trouble with distributed systems Flashcards
Context map
Process pauses
Unbounded network delays
Clock errors
Chapter Summary
Problems in distributed systems
- send a package over network - it may be lost or delayed for unknown time; reply may be lost - you’ve no idea if request was received
- node’s clock may be out of sync with other nodes;
- process can be paused for significant amount of time at any point of execution (stop-the-world GC). node can be then declared dead and come back to life later
Partial failures are defining characteristics of distributed systems.
When there is network communication involved it can:
- occasionally fail
- randomly go slow
- not respond at all
Design for failure - tolerate partial failures
Detecting failure is hard - timeouts are one way but not ideal. Dealing with node that is not really dead but was declared as such is difficult.
After fault detection it’s hard to make decisions based on quorum of nodes. No global variable, no shared memory no shared state makes consensus problem hard.
Working with distributed systems is fundamentally different from writing software for a single computer - lots of new ways for things to go wrong
Partial failure
In a single machine environemnt there is no reason to things be falky. Either it works or not.
Operations are deterministic.
Memory corruption leads to total system failure (kernel panic, blue scree on death, no boot)
In distributed systems some parts can be broken in unpredictable ways while others are fine.
They are NONDETERMINISTIC
Information traveling network takes UNKNOWN time to reach its destination so we don’t know if operation SUCCEDED or NOT
Difference between supercomputer (HPC) and cloud computing (or distributed computing)
High performance computing systems with 1000 CPUs are designed in a way that if one node fails - entire cluster fails.
Failure of a part escalates to total system crash.
Jobs do checkpoints periodically.
If failure occured - job is restarted from a checkpoint.
It’s more similar to single-node computer.
For cloud computing:
- no batch job like workloads - online, low latency, any time
- high availability is important
- nodes are built from lower cost components
- cloud network is based on IP, ethernet; HPC use specialized network topologies - meshes, toruses
- the bigger the system (more parts), the more likely one part will break; safe to assume there is always something broken so strategy of escalating failure to the whole system won’t work
- rolling upgrades, cattle like approach so service can serve users without interruption
- keeping data geographically close to the users reduces latency but communication goes over unreliable, slow internet; HPC assumes nodes are close together
Unreliable network - what could go wrong
- request is lost (broken network cable)
- request is waiting in a queue (recipient is overloaded)
- remote node has crashed
- remote node is temporarily not responding (process pause)
- remote process the request but the response was lost or delayed
Distributed system are … systems
Share nothing - bunch of machines connected by a network
One machine cannot access another’s memory or disk in any other way but by making a network request
Network paritition
AKA Netsplit
Condition when one part of the network is cut off from the rest due to network fault
Detecting faults
The only way is to know if request was successful is to receive positive response.
If anything go wrong we cannot assume we get a response back.
We can try app-level retries, timeouts - eventually node must be declared dead (to stop sending more requests there for example)
Some feedback about node’s failure:
- no process listening on app port, RST/FIN TCP packets returned (but if it crashed during request handling then no way to knoiw how much it processed)
- process crashed/was killed but OS is still running - we can have a script to notify other nodes about the crash
- management interface of network switch can be queried to detect link failures at hardware level (ex machine was powered down)
- router would send ICMP Destination Unreachable (but router is not an oracle)
How long should the timeout be
No simple answer
Long timeout - long wait until node is declared dead -> users will have poor experience
Short timeout - faster fault detection -> high risk of false positive on temporary slowdowns (load spike)
-> some operation might be retried and done twice;
-> other nodes taking the responsibility might get overloaded (cascading failures)
Assuming system which either delivers not longer than “d” and any successful request is handled in “r” time reasonable timeout is 2d + r
Problem - async networks have UNBOUNDED DELAYS
Packets are delivered as quickly as possible but no upper limit of delivery
Most common reason of network packet delays
Queueing
As with traffic jams - travel times vary due to traffic congestion
- several nodes try sending packets to the same node at the same time. Network deliver them to into the network link one-by-one. Overflow of queue - packet is dropped and must be resent
- packet reaches the node but all CPUs are busy - request is waiting in queue to be handled by an app.
- VM is paused (so other VMs can use CPU) - no data is consumed from the network so it must be buffered by VM monitor
- TCP flow control (backpressure) - node limits its own rate of sending to avoid overloading network link/receiver; extra queueing at sender side
- TCP auto retransmissions - if packet is not ACK within timeout (based on observed round-trip time) TCP retries which is an extra delay
How to choose timeout values?
Experimentally - measure distribution of network round-trip times over
- extended period of time
- differnet machines
Continually measure response times and variability (jitter) to automatically adjust timeouts according to observed response time distro. Similar to TCP retransmissions timeouts.
Phri Accrual failure detector (Akka, Cassandra)
Why network cannot have predictable delays?
Telephone network is circuit-switched network:
Fixed bandwith is allocated per call along entire route between the 2 callers for entire call duration.
The bandwith is limited (around 16 bits of audio every 250ms)
No queueing means that even though there’re many hops there is maximum e2e latency - BOUNDED DELAY.
Ethernet, IP are packet-switched network protocols:
TCP uses whatever bandwith is available.
Idle TCP connections DO NOT USE any BW at all.
TCP allows VARIABLE SIZED blocks of data
Adaptive rate of data transfer to available network capacity
Why - optimized for BURSTY TRAFFIC
Circuits are good for audio/video calls - constant, predictable number of bits per second for the duration of call.
Web pages, emails, downloading files - no fixed bandwith requirement -> latency is priority, complete ASAP
TCP way is cheaper - better resource utilization
Ways to emulate circuit switching on package networks
Quality of Service (priorities, scheduling of packets) Admission control (rate-limitation of senders)
Goal: statistically bounded delay
What is time and clocks used for in apps?
- detect timeout
- measure statistics
- 99th percentile
- user engagement on particular site
- requests per second on average over last X minutes
- publish date of resource
- cache expiry date
- when did an error occured
Measuring duration vs points in time
Why time in distributed systems is tricky?
Communication is not instantaneous
Message travels across network from one machine to another
Time when message is received is always later than the time it was sent
The delay is variable
Each machine has its own, IMPERFECT clock (dedicated hardware).
Clocks can go slightly faster or slower.
This makes it difficult
- to order events
- measure duration
based on time reported by machines involved.