7 - Fault Tolerance Flashcards
Dependability
Correctness of one component depends on another component
Dependability Example
Web server W requires database D.
D fails, so does W.
Dependability requires
Availability
Reliability
Safety
Maintainability
Availability
Readiness for usage
A(t) = probability that component is immediately usable
Reliability
Continuity of service delivery
R(t) = probability component works over time period [0,t]
Safety
Low risk of component failure leading to system failure.
Maintainability
Easy and quick to repair
When is a system more available but less reliable
more available meaning generally online but less reliable meaning scattered downtime
MTTF
Mean Time to Failure
Average time between start and failure
MTTR
Mean Time to Repair
Average time of repair
MTBF
Mean Time between Failures
MTBF = MTTF + MTTR
Availability (expression in relation to MTTF MTBF etc)
time available/whole time
MTTF/MTBF
Failure Rate z(t) of component C
z(t) is the (conditional) probability that C initially fails at t.
Reliability R(t) of component C
Declines exponentially even if z(t) is constant.
R(t) = e^(-z*t)
Density Function
f(t)
Probability of fail at
CDF
Cumulative Distribution Function
Probability to fail by t
Fault classifications
Permanent: exists until repaired
Intermittent: reappearing
Transient: occurs just once and disappears
Fault
Cause of error
Error
Part of component that can lead to failure
Failure
Component is not running as expected
Dealing with faults
Tolerant system
Good fault removal
Fault forecasting
Are parallel programs fault tolerant?
Often not fault-tolerant.
Wastes resources
Crashes as soon as one component fails
Checkpointing
Save program state in regular intervals and pick up from last one after a crash.
Avoid unneccessary checkpoints - predict crash
Failure Types
(one word names)
Halting
Omission
Timing
Response
Arbitrary/byzantine
Halting failure
Component stops
Omission failure
Failure to respond/receive/send messages
Timing failure
Response outside of specified time interval
Response failures
Incorrect response. Value or state-transition
Arbitrary/byzantine failure
Arbitrary responses at arbitrary times
Async or sync: Reliable crash failure detection
Async - cannot be reliably detected
Sync - can be reliably detected, detect by timeout but NO CERTAINTY
Halt fail types
stop - reliably detectable
noisy - eventually reliably detectable
silent - client cannot distinguish crash/omission
safe - benign
arbitrary - unobservable and harmful
Information Redundancy
Add extra error detection bits, eg checksum
Time redundancy
Designed to repeat action if required
Physical Redundancy
Add replicas of components
Triple Modular Redundancy
Every device replicated 3 times
Each replicate should produce same result.
Majority value is used
Process Groups
Dynamic group of (mostly) identical processes to achieve fault tolerance via physical redundancy.
All members receive all messages
Process Group examples
Replicated data store
Master-worker scheme
Hierarchical Group
One member is coordinator.
If coordinator fails, new one must be elected.
Flat Group
Members coordinate together
Majority decisions
Robust against crashes
Process Groups: Centralised membership
Group server manages.
- Efficient, Simple removal, Message sync easier
- Single point of failure, group must be reconstructed if server ceases
Process Groups: Distributed membership
Processes inform all of join/leave
- More fault tolerant, Many processes must fail before rebuild
- Less efficient, crashed processes must be detected, message sync harder
k-fault-tolerant group
A group can mask any k concurrent member failures
k is called degree of fault tolerance
k-fault-tolerant group needs to be how large for halting failures…?
k+1 because one member is enough
k-fault-tolerant groups needs to be how large for arbitrary/byzantine failures
2*k +1 are needed because a majority vote is required
Consensus
All processes in a process group agree
Flooding based consensus: Algorithm for each process
Pi sends its set of commands to all others.
For each pair (pi,pj), Pi detects either a Pj message or crash.
Pi merges all received commands
If Pi detects a crash.
- Pi does not exec a command.
- Pi runs error routine in next round
Else:
- Pi selects next command and exec
Flooding based consensus: Algorithm after failure
If Pi received all messages, but others did not:
- Pi sends msgs and decision from previous round to other processes
Elseif Pi did not:
- Pi signlas that it did not receive all
- Pi runs command from last round
Byzantine Consensus: Context
Some generals in the Byzantine Empire may be traitors (faulty processes) and deliberately send false messages
Byzantine Consensus: Attack or retreat?
If all (honest) retreat, it is fine.
If all (honest) attack, it will succeed.
If a subset attacks, it fails.
Generally, Byzantine Consensus can be reached when…
if k processes are faulty, then at least n = 3*k+1 are required
Failure Detection Types/Methods
Passive
Active
Gossiping
How is a byzantine failure caused?
By a faulty/malicious process producing a false value
How many byzantine failures can be tolerated in Triple Module Redundancy (3 voting devices)?
1.
If there’s 3 devices and 1 gives false results, the other 2 will still agree and create a majority
How many crash failures can Triple Modular Redundancy tolerate? and why?
It can tolerate 2.
Crashed devices don’t return ANY results so it leaves 1 functioning voter and hence, a majority.
How many byzantine failures can a system of n voters tolerate?
(n-1)/2 failures
How many crash failures can be tolerated by a system of n voters?
n-1 crashes