Architectural styles for Fault Tolerance Flashcards
What is meant by system fault tolerance?
Meaning: Fault tolerance means that the system can continue in operation in spite of a software fault i.e. the fault does not lead to a failure.
Note: Fault tolerance is required when there are high availability requirements, no ‘fail safe’ state or where system failure costs are very high.
Note: This is important even if the system has been proven to conform to its specification as there may be specification errors or the validation may be incorrect/incomplete.
What is meant by Dependable system architecture?
Meaning: a method for integrating fault tolerance techniques into a system to make it dependable
Note: Needed when fault tolerance is essential. Generally based on redundancy and diversity.
Examples of situations where dependable architectures are used:
- Flight control systems, where system failure could threaten the safety of passengers
- Reactor systems where failure of a control system could lead to a chemical or nuclear emergency
- Telecommunication systems, where there is a need for 24/7 availability.
What is meant by a protection system?
Meaning: A specialised system that is associated with some other control system, which can take emergency action if a failure occurs.
Note: Protection systems independently monitor the controlled system and the
environment.
Note: If a problem is detected, it issues commands to take emergency action to
shut down the system and avoid a catastrophe.
Example:
- System to stop a train if it passes
a red light.
- System to shut down a reactor if
temperature/pressure is too high.
Describe the functionality of protection systems?
Protection Systems are:
- Are redundant because they include monitoring and
control capabilities that replicate those in the control software. - Are diverse and use different technology from the
control software. - Are simpler than the control system so more effort can be expended in validation and dependability assurance.
Note: The aim of protection systems is to ensure that there is a low probability of failure on demand for the protection system.
What is meant by Self-monitoring architectures?
Meaning: Multi-channel architectures where the system monitors its own operations
and takes action if inconsistencies are detected.
Note: The same computation is carried out on each channel and the results are compared. If the results are identical and are produced at the same time, then it is assumed that the system is operating correctly.
Note: If the results are different, then a failure is assumed and a failure exception is raised.
Describe the functionality of Self-monitoring systems?
Self-monitoring systems have the following:
- Hardware in each channel has to be diverse, so that hardware failure doesn’t lead to each channel producing the same results.
- Software in each channel must also be diverse, otherwise the same software error would affect each channel.
Note: If high-availability is required, you may use several self-checking systems in parallel.
Key Note: This is the approach used in the Airbus family of aircraft for their fight control systems.
What is meant by N-version programming?
Meaning: A method in which multiple versions of a software system carry out computations at the same time.
Note: There should be an odd number of computers involved, typically 3.
Note: The results are compared using a voting system and the majority result is taken to be the correct one.
Note: Approach derived from the notion of triple-modular redundancy, as used in hardware systems.
How is the N-version programming process done in practice?
The different system versions are designed and implemented by different teams. It is assumed that there is a low probability that they will make the same mistakes. The algorithms used should but may not be different.
Note: There is some evidence that teams commonly misinterpret specifications in the same way and chose the same algorithms in their systems.
The key is to ensure that specs are in fact what you mean them to be!
Note on Software diversity
Note: Approaches to software fault tolerance embedded in the system architecture depend on software diversity where it is assumed that different implementations of the same software specification will fail in different ways.
Note: It is assumed that implementations are (a) independent and (b) do not include
common errors.
Note: Strategies to achieve diversity
* Different programming
languages
* Different design methods and
tools
* Explicit specification of different
algorithms
What are the problems with design diversity?
- Teams are not diverse (thinking wise) so they tend to tackle problems in the same way.
- Characteristic Errors:
- Different teams make the same
mistakes. Some parts of an
implementation are more difficult
than others so all teams tend to
make mistakes in the same place. - Specification errors:
- If there is an error in the
specification then this is reflected
in all implementations;
- This can be addressed to some
extent by using multiple
specification representations.
Note on Specification dependency
- Both approaches to software
redundancy are susceptible to
specification errors. If the
specification is incorrect, the
system could fail. - This is also a problem with
hardware but software
specifications are usually more
complex than hardware
specifications and harder to
validate. - This has been addressed in some
cases by developing separate
software specifications from the
same user specification.
Summary of Topic (Architectural Styles for Fault Tolerance)
- Dependable system
architectures are system
architectures that are designed
for fault tolerance. - Architectural styles that
support fault tolerance include
protection systems,
self-monitoring architectures
and N-version programming. - Software diversity is difficult to
achieve because it is practically
impossible to ensure that each
version of the software is truly
independent.