Module 10a - Fault Tolerance Flashcards
Describe the concept of Availability for fault tolerance in distributed computing
The system should operate correctly at any given instant in time.
Ex: a real system may be 99% available
Describe the concept of Reliability in distributed computing
The system should run continuously without interruption.
Ex: A real system may have a mean time between failures (MTBF) of one month
Describe the concept of Safety in distributed computing
Failure of the system should not have catastrophic consequences.
Ex: your car can still come to a complete stop if the ABS fails
Describe the concept of Maintainability in distributed computing
A failed system should be easy to repair.
Ex: disks can be replaced easily in a RAID
Define the term “error” in distributed computing
Error: A part of a system’s state that might lead to a failure.
Ex: dropped or damaged network packet
A ____ may lead to an _____ which may lead to a _____
fault
error
failure
Define the term “fault” in distributed computing
Fault: The cause of an error
Ex: When a person talking on the phone walks into an elevator
What are the 3 types of faults?
Transient faults
Intermittent faults
Permanent faults
What is a transient fault?
Transient faults occur once and then disappears
Ex: a bird flies in front of a microwave receiver
What is an intermittent fault?
Intermittent faults occur, vanish, then reappear. They are difficult to debug.
What is a permanent fault?
Permanent faults occur and will continue to exist until a faulty component is replaced.
Ex: burnt out power supply in a server
What are the 5 types of failures in distributed systems?
Crash failure Omission failure Timing failure Response failure Arbitrary failure
What is a Crash failure?
A server halts, but is working correctly until it halts
What is an Omission failure?
A server fails to respond to incoming requests
What is a Timing failure?
A server’s response lies outside the specified time interval