Lesson 7: Fault Tolerance Flashcards

Question 1

Q

What is a fault-tolerant system?

Answer

A

A system where we can detect a fault, remove its effect, and proceed normally.

Question 2

Q

What is the rollback-recovery technique?

Answer

A

When a failure is detected we rollback a previous state (consistent cut) that we know is correct, then continue.

Question 3

Q

What is the checkpointing mechanism?

Answer

A

Save the state of the process (or entire node) to persistent storage. If there is a failure, the checkpoint can be used to rebuild the state of the system before the failure.

+ restart is instantaneous
- lots of I/O on checkpoint (can be improved by only saving the deltas)

Question 4

Q

What is the logging mechanism?

Answer

A

Log information about operations performed. Record the original value (so we can UNDO) or log new value (so we can REDO)

+ smaller amount of I/O to write to disk
- recovery takes longer
- regular operations may take longer (search in log)

Question 5

Q

What is the checkpointing + logging mechanism?

Answer

A

Combines both checkpointing and logging mechanisms: checkpoint to move the recovery line to a more recent consistent cut. Log from that point on.

+ limit duration of recovery
+ limit space needed to store log
- must detect stable consistent cut

Question 6

Q

What is uncoordinated checkpointing?

Answer

A

Processes take checkpoints independently. On failure, we need to construct a consistent cut.

Problems
- Domino effect: could lose all your work
- Useless checkpoints: checkpoints that can never form a globally consistent state may be taken
- Multiple checkpoints per process: may need more than the most recent snapshots
- Garbage collection: needed to identify obsolete checkpoints

Question 7

Q

What is coordinated checkpointing?

Answer

A

Processes coordinate their checkpoints so they get a consistent state

Pros:
+ recovery no longer requires a dependency graph to calculate a recovery line. the latest checkpoint can be used
+ no domino effect. the coordination guarantees that the checkpoints taken are part of a consistent cut
+ single checkpoint per process
+ no garbage collection

Challenges
- how to coordinate?
- no synchronous clock guarantee
- message delivery reliable and in bounded time?
- are all checkpoints needed?

Question 8

Q

What is communication-induced checkpointing?

Answer

A

We use the global snapshot algorithm, but rather than send a marker message (which means we need FIFO) we can piggyback the marker message on a normal message.

Nodes that aren’t communicating with other nodes can take periodic snapshots.