Fault Tolerance - Week 3 Flashcards
Fault tolerance
operate in an acceptable way when a (partial) failure occurs.
Types of Failure
Omission Failures
Timing Failures
Response Failures
Arbitrary (byzantine) failures
Crashes
Omission Failures
Server fails to respond to incoming messages
Server fails to receive incoming messages
Server fails to send messages
Time failures
Server fails to respond within a certain time
Response failures
A server’s response is incorrect
Arbitrary (byzantine) failures
A component produces output it should never have produced (may not be detected as incorrect): arbitrary response at arbitrary times
Crashes
Server halts
Fault-tolerance / Failure masking - through redundancy
Physical redundancy
Time redundancy
Information redundancy
Physical redundancy
Having a backup server (no definition given in the slides)
Time redundancy
An action is performed, if need be, again and again.
Especially helpful when faults are transient and intermittent
Information redundancy
e.g. Send extra bits when transmitting information to allow recovery
Two generals problem - unreliable network
If the two generals don’t attack at the same time they die, they are on separate mountains.
With an unreliable channel:
G1 -> G2: Let’s attack at 9am
G2 -> G1: I received your message to attack
G2 doesn’t know if G1 received the message
In general there is no way to guarantee both generals got the message
Two generals problem - reliable network
If the two generals don’t attack at the same time they die, they are on separate mountains.
Assume a reliable communication channel
If one general is a traitor, with four generals you can spot the traitor
Redundancy Pros
Helps increase reliability
- increase probability that the system operates correctly at any given moment
Redundancy Cons
Creates several problems
- consistency of replicas (e.g. data on all replicas need to be updated)
- should improve (somehow) system performance.
Has a cost (monetary or other)
Even in the presence of redundancy, we need to make sure that any failure won’t leave our system in an inconsistent (corrupted) state
Triple modular Redundancy
A task is replicated three times, then fed to a series of three voters, can tell if one of the processes failed since it won’t match the other 2, the correct information is then passed to the next process by each voter (see image in Week3 OneNote)
Through replication three times, even if one component fails, the output will still be correct.
Used in airplanes. Chance of something going wrong is very low but still not zero.
Replication for performance
Placing a copy of data close to the process using it, time to access the data decreases.
Useful for scalability, e.g:
- Server needs to handle more requests, can replicate the server and subsequently dividing the work.
- Caching: web browsers store a copy of a website to avoid the latency of fetching it from the originating server again.
Capacity Planning
The process of determining the necessary capacity to meet a certain level of demand - extends beyond distributed computing.
E.g. Monte Carlo simulation re. number of servers
Replication Cons
Cost of maintaining replicas
Consistency problems
To ensure consistency all modifications have to occur on all copies, when and where determines the price of replication
Replica management
- Where to place replica servers to minimize overall data transfer?
- In general is a classic optimisation problem, but in practice often a mangement/commercial issue