Lesson 7: Fault Tolerance Flashcards
What is a fault-tolerant system?
A system where we can detect a fault, remove its effect, and proceed normally.
What is the rollback-recovery technique?
When a failure is detected we rollback a previous state (consistent cut) that we know is correct, then continue.
What is the checkpointing mechanism?
Save the state of the process (or entire node) to persistent storage. If there is a failure, the checkpoint can be used to rebuild the state of the system before the failure.
+ restart is instantaneous
- lots of I/O on checkpoint (can be improved by only saving the deltas)
What is the logging mechanism?
Log information about operations performed. Record the original value (so we can UNDO) or log new value (so we can REDO)
+ smaller amount of I/O to write to disk
- recovery takes longer
- regular operations may take longer (search in log)
What is the checkpointing + logging mechanism?
Combines both checkpointing and logging mechanisms: checkpoint to move the recovery line to a more recent consistent cut. Log from that point on.
+ limit duration of recovery
+ limit space needed to store log
- must detect stable consistent cut
What is uncoordinated checkpointing?
Processes take checkpoints independently. On failure, we need to construct a consistent cut.
Problems
- Domino effect: could lose all your work
- Useless checkpoints: checkpoints that can never form a globally consistent state may be taken
- Multiple checkpoints per process: may need more than the most recent snapshots
- Garbage collection: needed to identify obsolete checkpoints
What is coordinated checkpointing?
Processes coordinate their checkpoints so they get a consistent state
Pros:
+ recovery no longer requires a dependency graph to calculate a recovery line. the latest checkpoint can be used
+ no domino effect. the coordination guarantees that the checkpoints taken are part of a consistent cut
+ single checkpoint per process
+ no garbage collection
Challenges
- how to coordinate?
- no synchronous clock guarantee
- message delivery reliable and in bounded time?
- are all checkpoints needed?
What is communication-induced checkpointing?
We use the global snapshot algorithm, but rather than send a marker message (which means we need FIFO) we can piggyback the marker message on a normal message.
Nodes that aren’t communicating with other nodes can take periodic snapshots.