Software Based Resilience Flashcards
At the SW level, what can a fault cause?
Affect the control flow of a program
Modify the data and output of a program.
Motivation for Software Based Resilience.
Many faults affect the execution of the software and, thus, can be detected at the software level.
What is the control flow graph of a program?
Representation of all possible executions of a program. The nodes represent basic blocks and the edges transitions among the blocks.
What is a basic block
Sequence of code with no branch instructions(except for the last one in the block) or branch targets (except for the 1st instruction).
When and where to save the SW state in check-pointing.
The time interval to check point is a trade off between checkpoint overhead and recovery latency.
Storage element must have sufficient protection against faults so that recovery is likely.
• EDDI (Error Detection by Duplicated Instructions)
• CFCSS (Control Flow Checking by Software Signatures)
• ED4I(Error Detection by Diverse Data and Duplicated Instructions)
are examples of…
Software implemented HW Fault tolerance.
Purely sw based approaches.
Idea behind EDDI
What can be used to reduce overhead?
Error Detection by Duplicated Instructions.
duplicate every instruction but use different variables and registers.
If an error occurs program jumps to a handles routine.
Instruction level parallelism.
Overhead between 13 and 111%.
What is a Storeless Basic Block (SBB) in EDDI?
Sequence of instructions without store, branch or branch targets.
Used as checking instruction points.
CFCSS:
Control Flow Checking by Software Signatures
A signature is assigned to each basic block and to each transition.
The run‐time signature G is placed in a general‐purpose register. The checking instructions are placed at the beginning of each basic block.
ED4I
Error Detection by Diverse Data and Duplicate Instructions.
EDDI cannot detect permanent faults.
Constants and variables in a program are multiplied by a diversity factor k, which is chosen such that is likely that a fault affects the two programs differently.