The only way is to know if request was successful is to receive positive response. If anything go wrong we cannot assume we get a response back. We can try app-level retries, timeouts - eventually node must be declared dead (to stop sending more requests there for example) Some feedback about node's failure: - no process listening on app port, RST/FIN TCP packets returned (but if it crashed during request handling then no way to knoiw how much it processed) - process crashed/was killed but OS is still running - we can have a script to notify other nodes about the crash - management interface of network switch can be queried to detect link failures at hardware level (ex machine was powered down) - router would send ICMP Destination Unreachable (but router is not an oracle)

Data Intensive Ch8 - Trouble with distributed systems Flashcards by Keanu Keanu

Context map

Process pauses
Unbounded network delays
Clock errors

How well did you know this?

Not at all

Perfectly

Chapter Summary

Problems in distributed systems

send a package over network - it may be lost or delayed for unknown time; reply may be lost - you’ve no idea if request was received
node’s clock may be out of sync with other nodes;
process can be paused for significant amount of time at any point of execution (stop-the-world GC). node can be then declared dead and come back to life later

Partial failures are defining characteristics of distributed systems.
When there is network communication involved it can:
- occasionally fail
- randomly go slow
- not respond at all

Design for failure - tolerate partial failures

Detecting failure is hard - timeouts are one way but not ideal. Dealing with node that is not really dead but was declared as such is difficult.

After fault detection it’s hard to make decisions based on quorum of nodes. No global variable, no shared memory no shared state makes consensus problem hard.

Working with distributed systems is fundamentally different from writing software for a single computer - lots of new ways for things to go wrong

How well did you know this?

Not at all

Perfectly

Partial failure

In a single machine environemnt there is no reason to things be falky. Either it works or not.
Operations are deterministic.
Memory corruption leads to total system failure (kernel panic, blue scree on death, no boot)

In distributed systems some parts can be broken in unpredictable ways while others are fine.
They are NONDETERMINISTIC
Information traveling network takes UNKNOWN time to reach its destination so we don’t know if operation SUCCEDED or NOT

How well did you know this?

Not at all

Perfectly

Difference between supercomputer (HPC) and cloud computing (or distributed computing)

High performance computing systems with 1000 CPUs are designed in a way that if one node fails - entire cluster fails.
Failure of a part escalates to total system crash.
Jobs do checkpoints periodically.
If failure occured - job is restarted from a checkpoint.
It’s more similar to single-node computer.

For cloud computing:

no batch job like workloads - online, low latency, any time
high availability is important
nodes are built from lower cost components
cloud network is based on IP, ethernet; HPC use specialized network topologies - meshes, toruses
the bigger the system (more parts), the more likely one part will break; safe to assume there is always something broken so strategy of escalating failure to the whole system won’t work
rolling upgrades, cattle like approach so service can serve users without interruption
keeping data geographically close to the users reduces latency but communication goes over unreliable, slow internet; HPC assumes nodes are close together

How well did you know this?

Not at all

Perfectly

Unreliable network - what could go wrong

request is lost (broken network cable)
request is waiting in a queue (recipient is overloaded)
remote node has crashed
remote node is temporarily not responding (process pause)
remote process the request but the response was lost or delayed

How well did you know this?

Not at all

Perfectly

Distributed system are … systems

Share nothing - bunch of machines connected by a network

One machine cannot access another’s memory or disk in any other way but by making a network request

How well did you know this?

Not at all

Perfectly

Network paritition

AKA Netsplit

Condition when one part of the network is cut off from the rest due to network fault

How well did you know this?

Not at all

Perfectly

Detecting faults

The only way is to know if request was successful is to receive positive response.
If anything go wrong we cannot assume we get a response back.
We can try app-level retries, timeouts - eventually node must be declared dead (to stop sending more requests there for example)

Some feedback about node’s failure:

no process listening on app port, RST/FIN TCP packets returned (but if it crashed during request handling then no way to knoiw how much it processed)
process crashed/was killed but OS is still running - we can have a script to notify other nodes about the crash
management interface of network switch can be queried to detect link failures at hardware level (ex machine was powered down)
router would send ICMP Destination Unreachable (but router is not an oracle)

How well did you know this?

Not at all

Perfectly

How long should the timeout be

No simple answer
Long timeout - long wait until node is declared dead -> users will have poor experience
Short timeout - faster fault detection -> high risk of false positive on temporary slowdowns (load spike)
-> some operation might be retried and done twice;
-> other nodes taking the responsibility might get overloaded (cascading failures)

Assuming system which either delivers not longer than “d” and any successful request is handled in “r” time reasonable timeout is 2d + r
Problem - async networks have UNBOUNDED DELAYS
Packets are delivered as quickly as possible but no upper limit of delivery

How well did you know this?

Not at all

Perfectly

Most common reason of network packet delays

Queueing
As with traffic jams - travel times vary due to traffic congestion

several nodes try sending packets to the same node at the same time. Network deliver them to into the network link one-by-one. Overflow of queue - packet is dropped and must be resent
packet reaches the node but all CPUs are busy - request is waiting in queue to be handled by an app.
VM is paused (so other VMs can use CPU) - no data is consumed from the network so it must be buffered by VM monitor
TCP flow control (backpressure) - node limits its own rate of sending to avoid overloading network link/receiver; extra queueing at sender side
TCP auto retransmissions - if packet is not ACK within timeout (based on observed round-trip time) TCP retries which is an extra delay

How well did you know this?

Not at all

Perfectly

How to choose timeout values?

Experimentally - measure distribution of network round-trip times over

extended period of time
differnet machines

Continually measure response times and variability (jitter) to automatically adjust timeouts according to observed response time distro. Similar to TCP retransmissions timeouts.

Phri Accrual failure detector (Akka, Cassandra)

How well did you know this?

Not at all

Perfectly

Why network cannot have predictable delays?

Telephone network is circuit-switched network:
Fixed bandwith is allocated per call along entire route between the 2 callers for entire call duration.
The bandwith is limited (around 16 bits of audio every 250ms)
No queueing means that even though there’re many hops there is maximum e2e latency - BOUNDED DELAY.

Ethernet, IP are packet-switched network protocols:
TCP uses whatever bandwith is available.
Idle TCP connections DO NOT USE any BW at all.
TCP allows VARIABLE SIZED blocks of data
Adaptive rate of data transfer to available network capacity

Why - optimized for BURSTY TRAFFIC
Circuits are good for audio/video calls - constant, predictable number of bits per second for the duration of call.
Web pages, emails, downloading files - no fixed bandwith requirement -> latency is priority, complete ASAP

TCP way is cheaper - better resource utilization

How well did you know this?

Not at all

Perfectly

Ways to emulate circuit switching on package networks

Quality of Service (priorities, scheduling of packets)
Admission control (rate-limitation of senders)

Goal: statistically bounded delay

How well did you know this?

Not at all

Perfectly

What is time and clocks used for in apps?

detect timeout
measure statistics
- 99th percentile
- user engagement on particular site
- requests per second on average over last X minutes
publish date of resource
cache expiry date
when did an error occured

Measuring duration vs points in time

How well did you know this?

Not at all

Perfectly

Why time in distributed systems is tricky?

Communication is not instantaneous
Message travels across network from one machine to another
Time when message is received is always later than the time it was sent
The delay is variable
Each machine has its own, IMPERFECT clock (dedicated hardware).
Clocks can go slightly faster or slower.

This makes it difficult
- to order events
- measure duration
based on time reported by machines involved.

How well did you know this?

Not at all

Perfectly

What are 2 kinds of clocks?

Study These Flashcards

Time of day

> For Measuring wall-clock time
> synchronized usually with NTP
> Synchronization may cause “reset” to the point in the past
> No-no for measuring elapsed time

Monotonic

> For Measuring duration
> Always move forward (no jump back in time)
> Absolute value is meaningless
> Absolute value must not be compared between machines
> Typically separate timer per CPU socket
> NTP may adjust FREQUENCY of moving forward - CLOCK SLEWING
> Resolution of microseconds or less

Differ in purpose

What’s NTP?

Study These Flashcards

Network Time Protocol

Servers are used to sync time

Clock sync issues

Study These Flashcards

Clock drift, clock resets and jumps in time, VM pauses, leap seconds handling, malicious NTP servers
- Quartz clock is inaccurate - drifts (faster or slower)
Google assumes 200 parts per milion drift (17 seconds drift for a clock sync once a day)
- If clock differs too much from NTP it may refuse to sync or FORCIBLY RESET - app will observe jump backward/forward in time
- firewall may block node from NTP and clock skew will go unnoticed for some time
- sync is susceptible to network delay as any network request (min error of 35ms is achievable when syncing over the Internet)
- NTP server may be wrong/misconfigured/malicious
Node should ask multiple NTP servers
- Leap seconds - minute may be 59 or 61 seconds -> may cause crashing
SMEARING - adjust leap seconds gradually over the course of a day techqniue - not all NTP supports it
- VM have virtualized hardware clocks - when VM is paused app will observe sudden jump in time after it’s resumed
- running software on untrusted devices (mobiles, embedded) - users can set their clock arbitrarly

What to do if software requires synchronized clock?

Study These Flashcards

MINIMIZING THE DAMAGE of sync drifts
MONITOR clock offsets between ALL machines
Node that drifts too far from others should be declared dead and removed from the cluster

Timestamp based replication problems

Study These Flashcards

Example Figure 8-3 292 page
Even if there is a skew of 3 ms when node with lagging clock tries to replicate a newly written value it might send a timestamp lower than last write on fast-clock node.

Last write wins conflict resolution strategy with timestamps issues

fast clock db may prevent other nodes from writing until skew of time elapsed on slow-clock node
Cannot distinguish sequential writes in quick succession from truly concurrent ones
2 nodes can generate writes with the same timestamp (especially if low-resolution clock is used - miliseconds)
- use tiebreaker value (random number fex) - violation of causality

Using clocks for DB snapshot isolation level

Study These Flashcards

generating monotically increasing trx ID across all partitions is expensive for lots of small, rapid trx (Snowflake)

We can use timestamps if we know uncertainty about clock accuracy
Interval Timestamps with earliest and latest
If 2 timestamps do not overlap - we’re good to go
Otherwise the order is unknown

Google Spanner, datacenters have GPS-based/atomic clock

What is lease?

Study These Flashcards

Similar to lock with a timeout
Nodes can use them to determine they’re leaders
Must be renew periodically.
When node fails - it won’t renew. After lease expires - new one can take it over
Kinesis lease for shard processing

Lease validation/renewing issuesd

Study These Flashcards

while True:
  request = get_incoming_request()
  if self.lease.expiry - current_millis() < 10000:
    self.lease = self.lease.renew()
  if lease.valid():
    process(request)

Issues:

relies on synchronized clocks (expiry of lease is set by different machine)
code assumes little time passes between checking the time and processing requests. 10 seconds buffer is usually fine BUT what if thread stops for 11 seconds

What is thrashing

Study These Flashcards

Swapping pages in and out of memory with little actual work done.
Better to keep paging disabled on server machines!

Why is it a bad idea to rely that very little time would pass in between 2 lines of code?

Nasty process pauses and thread preemption GC, surprise IO (classload, mem swap), client device pause, VM/thread ctx switch, VM live migration - Stop-the-world GC in JVM - VM can be suspended and resumed at any time, for any duration (LIVE MIGRATION of VM from one host to another without rebooting - save VM state and restore it elsewhere) - Client devices can be suspended and resumed aribtrarly (close lid of laptop, lock the phone) - Ctx switch to another thread/VM hypervisor switches to another VM. Current thread is suspended at any point in code. Heavy load - long pause - App performs sync IO - thread is paused; some are implicit java classloader lazily loading class files on first use - OS Swapping to disk (paging) - memory access leads to IO - Unix process SIGSTOP (ctrl+z) signal; OPS engineer may accidentally do that :D

Why s it hard to make multi-threaded code in distributed system

No shared memory - only network communication Cannot use mutexes, semaphores, atomic counters, lock-free data structures, blocking queues etc Nodes must assume arbitrary execution pauses

What is real-time OS?

Real time means system is designed and tested to meet time guarantees under all circumstances (no process pause is welcomed in the airbag release system) Software must respond within specified deadline or it may cause a failure (watchdog?) All functions must document worst-case exec time. Dynamic memory allocation is usually restricted Low throughput due to time guarantees constraints. Expensive

How to limit impact of GC?

Treat GC as brief planned outage. Runtime should provide notifications that GC pause is soon required. Load balance should take out the node until GC is done. Other idea: use GC for short-lived objects and restart an app periodically to clear stale long-lived objects.

The truth is defined by majority meaning

Asymetric network fault - node can receive all messages sent to but is unable to respond. After timeout - other nodes declare it dead even though it works. Node is paused due to GC, it's declared dead and then comes back. Other nodes resume talking to it. The resumed node does not realize how much time has passed. Nodes can't trust their own judgment - prone to errors. Holding a lease and doing lease-requiring operation after long GC pause is an example. Fig 8-4, p302 Quorum is needed - if quorum declares node dead then it is dead and stays dead.

Fencing tokens

Solve issue of nodes falsely believing they are still holding a lease. Server granting lock/lease returns fencing token - number monotonically increasing each grant. Requirement - clients sending write requests pass fencing tokens. When larger fencing token was observed - writes with lower one are rejected. ZooKeper as a lock service. zxid or cversion can be used as fencing token. It is assumed that nodes are UNRELIABLE but HONEST Byzantine Generals Problem - there're nodes that cannot be trusted - goal is to reach consensus. Byzantine in the sense of excessively complicated

Which systems need to be Byzantine Fault Tolerant?

BFT - system continues to operate correctly even though there're Byzantine nodes (lying) Aeorspace systems - CPU system or memory can be corrupted by radiation. That leads to unpredictable responses. System with multiple organizations participating - some might attempt to cheat. Peer to peer networks like Bitcoin. Mutual agreement if transaction happened or not between untrusting parties.

Protection against weak forms of "lying"

Checksums in app-lvl protocol (extra protection against corrupted packets which is already provided by TCP/UDP) User Input Sanitization Range of values, max size of strings, prevent SQL injection NTP clients should use multiple servers Estimate error of all nodes, take the majority

UDP

User Datagram Protocol No ACK, no retransmissions Video streaming - no point of resending audio from the past

Three system models with regard to timing

Synchronous - bounded network delay, pauses, clock drifts; not realistic Partially Synchronous - sync most of the time but can exceed the bounds for network delay, pauses, clock drifts Realistic model for many systems - most of the time network behaves Asynchronous - no timing assumptions, no clock, no timeouts; suoer restrictive

Types of node failures

Crash-stop faults - node can fail only by crashing; if it stops responding then it's gone forever Crash-recovery faults - node can crash at any moment but may stat responding again after unknown time. Assumes stable-storage (nonvolatile) preserved across crashes; in-memory state is assumed to be lost Byzantine faults - nodes can do anything - trick and deceive other nodes too

Algorithm correctness in distribured system - Safety and liveness properties

Safety - nothing bad happens (for fencing tokens uniqueness and monotonic sequence) If property is violated we can point at particular point in time when it was broken; cannot be undone Liveness - something good EVENTUALLY happens May not hold at some point in time but it can be satisfied in future (send request, may receive response later)

Data Intensive Ch8 - Trouble with distributed systems Flashcards

(36 cards)