LU7 Fault Tolerance and Recovery Flashcards

1
Q

What is dependability in distributed systems?

A

Dependability refers to the ability of a system to deliver service that can be justifiably trusted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the properties of dependability?

A

Availability, Reliability, Safety, and Maintainability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does availability mean in the context of distributed systems?

A

Availability is the readiness of a system for usage when needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does reliability mean in distributed systems?

A

Reliability is the continuity of service delivery without interruptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does safety refer to in distributed systems?

A

Safety refers to the low probability of catastrophic failures in the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is maintainability in distributed systems?

A

Maintainability is the ease with which a failed system can be repaired and restored to service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a failure in distributed systems?

A

A failure occurs when a component does not meet its specified behavior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an error in distributed systems?

A

An error is a part of a component’s state that can lead to a failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a fault in distributed systems?

A

A fault is the cause of an error within the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is fault prevention?

A

Fault prevention involves techniques to prevent faults from occurring in the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is fault tolerance?

A

Fault tolerance is designing a system to meet specifications even in the presence of faults.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is fault removal?

A

Fault removal involves reducing the presence, number, or severity of faults in the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is fault forecasting?

A

Fault forecasting estimates the current number, future incidence, and consequences of faults.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is process resilience?

A

Process resilience involves protecting against faulty processes by replicating and distributing computations in a group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are flat groups in process resilience?

A

Flat groups allow immediate information exchange with all members, enhancing fault tolerance but increasing overhead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are hierarchical groups in process resilience?

A

Hierarchical groups have communication through a single coordinator, which is easier to implement but less fault-tolerant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is k-fault tolerance?

A

A group is k-fault tolerant if it can mask any k concurrent member failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How many members are needed for k-fault tolerance under crash semantics?

A

k + 1 members are needed to survive k member failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How many members are needed for k-fault tolerance under arbitrary failure semantics?

A

2k + 1 members are needed if group output is defined by voting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Byzantine failure?

A

Byzantine failure refers to arbitrary failures where components may act maliciously or unpredictably.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How many members are needed to handle Byzantine failures?

A

3k + 1 members are needed to tolerate k Byzantine failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is failure detection in distributed systems?

A

Failure detection involves identifying failed components using mechanisms like timeouts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why is setting timeouts challenging in failure detection?

A

Timeouts are difficult to set correctly and depend on application-specific requirements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is gossiping in failure detection?

A

Gossiping is the proactive dissemination of failure detection information throughout the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is reliable communication in distributed systems?
Reliable communication ensures messages are delivered correctly and in order despite failures.
26
What are common issues in reliable client-server communication?
Client cannot locate server, request loss, server crashes, response loss, and client crashes.
27
What is at-least-once semantics in reliable RPC?
The server guarantees it will carry out an operation at least once, even if it crashes.
28
What is at-most-once semantics in reliable RPC?
The server guarantees it will carry out an operation at most once, avoiding duplicate executions.
29
What is an orphan computation?
An orphan computation occurs when a server continues processing after the client has crashed.
30
How are orphan computations handled?
Orphans are killed by the client upon recovery or removed after a timeout.
31
What is reliable multicasting?
Reliable multicasting ensures messages sent to a group are delivered to all intended recipients.
32
What is atomic multicast?
Atomic multicast ensures a message is delivered to all or none of the recipients in a group.
33
What is feedback suppression in scalable reliable multicasting?
Feedback suppression reduces redundant retransmission requests by suppressing duplicate feedback from receivers.
34
What is hierarchical reliable multicasting?
A hierarchical structure aggregates feedback through intermediate nodes to improve scalability.
35
What is distributed commit?
Distributed commit ensures that all processes in a distributed transaction either commit or abort together.
36
What is the two-phase commit (2PC) protocol?
2PC is a protocol where the coordinator collects votes from participants to commit or abort a transaction.
37
What is the three-phase commit (3PC) protocol?
3PC adds an additional phase to 2PC to prevent participants from blocking indefinitely.
38
What are the phases of 2PC?
Vote-request, vote-commit/vote-abort, and global-commit/global-abort.
39
What are the phases of 3PC?
Vote-request, prepare-commit/global-abort, and global-commit.
40
What happens if a participant crashes during 2PC?
The participant recovers its state from logs or queries other participants for the coordinator's decision.
41
What happens if the coordinator crashes during 2PC?
Participants remain blocked until the coordinator recovers and provides the decision.
42
How does 3PC address coordinator failure issues?
3PC allows participants to proceed without blocking by adding a pre-commit phase.
43
What is forward error recovery?
Forward error recovery finds a new state from which the system can continue after a failure.
44
What is backward error recovery?
Backward error recovery brings the system back to a previous error-free state.
45
What is checkpointing in recovery?
Checkpointing saves the system state at intervals to enable recovery to a known good state.
46
What is message logging in recovery?
Message logging stores communication events to replay and recover system state after a failure.
47
What is a consistent recovery state?
A state where all received messages are shown to have been sent, ensuring data consistency.
48
What is a recovery line?
The most recent consistent global checkpoint across all processes in a distributed system.
49
What is a cascaded rollback?
A rollback that propagates through the system, potentially reverting to the initial state due to inconsistent checkpoints.
50
What is the domino effect in recovery?
A situation where checkpoints lead to cascading rollbacks to the system's start, complicating recovery.
51
What is the piecewise deterministic execution model?
A model where process execution is deterministic between nondeterministic events like message receipts.
52
Why is avoiding orphans important in message logging?
Orphans lead to inconsistent states that cannot be correctly replayed during recovery.
53
What is the role of nondeterministic events in message logging?
Recording nondeterministic events ensures deterministic replay during system recovery.
54
How does reliable communication differ from process resilience?
Reliable communication ensures message delivery, while process resilience handles faulty processes through replication.
55
What is the difference between forward and backward recovery?
Forward recovery corrects the error state, while backward recovery reverts to a previous correct state.
56
What is the main challenge in setting timeouts for failure detection?
Differentiating between process and network failures makes setting appropriate timeouts difficult.
57
What is the advantage of hierarchical reliable multicasting?
It reduces feedback overhead and improves scalability by aggregating retransmission requests.
58
What is the purpose of the pre-commit phase in 3PC?
To ensure participants do not block indefinitely if the coordinator fails.
59
What happens during the vote-request phase in 2PC?
The coordinator asks participants to vote on whether to commit or abort the transaction.
60
How do participants respond in the prepare-commit phase of 3PC?
Participants wait for the final commit or abort decision after indicating readiness to commit.
61
What is the role of the coordinator in distributed commit protocols?
The coordinator manages the commit or abort decision process and ensures consistency.
62
How do checkpoints help in backward error recovery?
Checkpoints provide a reference point to revert to in case of system failures.
63
What is the significance of consistent cuts in recovery?
Consistent cuts ensure all processes have a coherent view of message exchanges for accurate recovery.
64
What is the importance of logging nondeterministic events?
Logging ensures the system can accurately replay and recover its state after a failure.
65
How does scalable reliable multicasting handle feedback suppression?
By allowing processes to suppress feedback if another process has already requested retransmission.
66
What is the consequence of incorrect checkpoint timing?
Incorrect timing can lead to cascaded rollbacks or the domino effect, complicating recovery.
67
How does 2PC handle participant failures during the ready state?
Participants log the coordinator's decision to ensure they can recover to the correct state.
68
What is the essence of process resilience in distributed systems?
Ensuring system functionality despite faulty processes through replication and distributed computations.
69
Why is reliable communication critical in distributed systems?
It ensures data consistency and system coordination despite potential failures.
70
What is the role of temporary workspaces in distributed commit?
They allow simple recovery by storing intermediate results that can be committed or discarded.
71
What are the challenges of reliable multicasting in wide-area networks?
Ensuring message delivery and consistency across diverse and potentially unreliable network paths.
72
What is the difference between crash and arbitrary failure semantics?
Crash failures stop operations, while arbitrary failures may produce incorrect or unpredictable behavior.
73
How does the system ensure atomicity in distributed commit protocols?
By ensuring all participants agree to commit or abort the transaction together.
74
What is the impact of coordinator failure in 2PC?
It can block participants until the coordinator recovers and provides a decision.
75
How does 3PC improve over 2PC in handling failures?
By introducing a pre-commit phase that prevents participants from blocking indefinitely.
76
What is the role of gossiping in failure detection?
It proactively spreads failure information to ensure all nodes are aware of failures.
77
How does message logging contribute to fault tolerance?
It allows the system to replay messages and recover to a consistent state after a failure.
78
What is the importance of maintainability in dependability?
It determines how easily a system can be repaired and restored to service after a failure.
79
What is the purpose of framing in error detection?
Framing allows for detecting bit errors in transmitted packets.
80
What is the significance of idempotent operations in reliable RPC?
Idempotent operations can be safely retried without adverse effects, aiding in fault tolerance.
81
What is the advantage of using hierarchical feedback channels in multicasting?
They aggregate feedback to reduce overhead and improve scalability.
82
How does the system handle lost server responses in reliable RPC?
By making operations idempotent or using retransmission strategies.
83
What is the role of the recovery line in consistent recovery?
It marks the latest point where the system state is consistent across all processes.
84
What is the significance of the domino effect in distributed recovery?
It illustrates how improper checkpointing can lead to extensive rollbacks, complicating recovery.
85
How does scalable reliable multicasting ensure efficiency?
By using feedback suppression and hierarchical structures to manage retransmissions.
86
What is the purpose of logging the coordinator's decision in 2PC?
To allow participants to recover to the correct state after a failure.
87
What is the consequence of a failed coordinator in distributed commit?
Participants may block until a new coordinator is elected or the original recovers.
88
What is the main challenge of achieving reliable communication in distributed systems?
Ensuring message delivery and order despite network failures and delays.
89
What is the role of temporary workspaces in fault-tolerant computing?
They store intermediate results, simplifying recovery and rollback processes.