LU7 Fault Tolerance and Recovery Flashcards
What is dependability in distributed systems?
Dependability refers to the ability of a system to deliver service that can be justifiably trusted.
What are the properties of dependability?
Availability, Reliability, Safety, and Maintainability.
What does availability mean in the context of distributed systems?
Availability is the readiness of a system for usage when needed.
What does reliability mean in distributed systems?
Reliability is the continuity of service delivery without interruptions.
What does safety refer to in distributed systems?
Safety refers to the low probability of catastrophic failures in the system.
What is maintainability in distributed systems?
Maintainability is the ease with which a failed system can be repaired and restored to service.
What is a failure in distributed systems?
A failure occurs when a component does not meet its specified behavior.
What is an error in distributed systems?
An error is a part of a component’s state that can lead to a failure.
What is a fault in distributed systems?
A fault is the cause of an error within the system.
What is fault prevention?
Fault prevention involves techniques to prevent faults from occurring in the system.
What is fault tolerance?
Fault tolerance is designing a system to meet specifications even in the presence of faults.
What is fault removal?
Fault removal involves reducing the presence, number, or severity of faults in the system.
What is fault forecasting?
Fault forecasting estimates the current number, future incidence, and consequences of faults.
What is process resilience?
Process resilience involves protecting against faulty processes by replicating and distributing computations in a group.
What are flat groups in process resilience?
Flat groups allow immediate information exchange with all members, enhancing fault tolerance but increasing overhead.
What are hierarchical groups in process resilience?
Hierarchical groups have communication through a single coordinator, which is easier to implement but less fault-tolerant.
What is k-fault tolerance?
A group is k-fault tolerant if it can mask any k concurrent member failures.
How many members are needed for k-fault tolerance under crash semantics?
k + 1 members are needed to survive k member failures.
How many members are needed for k-fault tolerance under arbitrary failure semantics?
2k + 1 members are needed if group output is defined by voting.
What is Byzantine failure?
Byzantine failure refers to arbitrary failures where components may act maliciously or unpredictably.
How many members are needed to handle Byzantine failures?
3k + 1 members are needed to tolerate k Byzantine failures.
What is failure detection in distributed systems?
Failure detection involves identifying failed components using mechanisms like timeouts.
Why is setting timeouts challenging in failure detection?
Timeouts are difficult to set correctly and depend on application-specific requirements.
What is gossiping in failure detection?
Gossiping is the proactive dissemination of failure detection information throughout the system.