Fault Tolerance - Week 3 Flashcards
Fault tolerance
operate in an acceptable way when a (partial) failure occurs.
Types of Failure
Omission Failures
Timing Failures
Response Failures
Arbitrary (byzantine) failures
Crashes
Omission Failures
Server fails to respond to incoming messages
Server fails to receive incoming messages
Server fails to send messages
Time failures
Server fails to respond within a certain time
Response failures
A server’s response is incorrect
Arbitrary (byzantine) failures
A component produces output it should never have produced (may not be detected as incorrect): arbitrary response at arbitrary times
Crashes
Server halts
Fault-tolerance / Failure masking - through redundancy
Physical redundancy
Time redundancy
Information redundancy
Physical redundancy
Having a backup server (no definition given in the slides)
Time redundancy
An action is performed, if need be, again and again.
Especially helpful when faults are transient and intermittent
Information redundancy
e.g. Send extra bits when transmitting information to allow recovery
Two generals problem - unreliable network
If the two generals don’t attack at the same time they die, they are on separate mountains.
With an unreliable channel:
G1 -> G2: Let’s attack at 9am
G2 -> G1: I received your message to attack
G2 doesn’t know if G1 received the message
In general there is no way to guarantee both generals got the message
Two generals problem - reliable network
If the two generals don’t attack at the same time they die, they are on separate mountains.
Assume a reliable communication channel
If one general is a traitor, with four generals you can spot the traitor
Redundancy Pros
Helps increase reliability
- increase probability that the system operates correctly at any given moment
Redundancy Cons
Creates several problems
- consistency of replicas (e.g. data on all replicas need to be updated)
- should improve (somehow) system performance.
Has a cost (monetary or other)
Even in the presence of redundancy, we need to make sure that any failure won’t leave our system in an inconsistent (corrupted) state