Safety-critical systems Flashcards
Critical system
A system where failure or lack of availability has a serious human, environmental or economic effect
Critical system essentials
1) Safety: the system should not harm people or the system’s environment
2) Reliability: the system must operate without serious failures
3) Availability: the system must be available to deliver services when requested to do so
4) Security: the system must be able to protect itself and its data from malicious use
Reliability regime
- A description of how the system should operate when it fails
- There are five key regimes that many safety-critical systems satisfy
1) Fail-operational system
2) Fail-safe system
3) Fail-secure system
4) Fail-passive system
5) Fault-tolerant systems
Fail-operational system
- Fail-operational systems continue to operate when their control systems fail
- This failure mode is sometimes unsafe, and is hence not applicable for all systems
- Sometimes, such systems are ‘fail deadly’
Fail-safe system
- Fail-safe systems become safe when they are unable to operate
- This often involves disabling functionality and alerting operating staff
Fail-secure system
- Fail-secure systems maintain maximum security when they are unable to operate correctly
Fail-passive system
- Fail-passive systems continue to operate in the event of a system failure
- They will alert the operator to allow manual control to resume safely
Fault-tolerant systems
- Fault-tolerant systems avoid service failure when faults are introduced to the system
- This can often involve redundancy and hot-failover, but may also be used to describe systems that operate correctly at reduced capacity
Risk factors in technological societies
1) Increasing Complexity
2) Increasing Exposure
3) Increasing Automation
4) Increasing Centralisation and Scale
5) Increasing pace of technological change
Increasing complexity
- The first risk factor in technological societies
- High tech systems are often made up of networks of closely related subsystems
- A problem in one subsystem may cause problems in other subsystems
- Analyses of major industrial accidents invariably reveal highly complex sequences of events leading up to accidents rather than single component failures
- More subcomponents = More complexity
Increasing exposure
- The second risk factor in technological societies
- More people today may be exposed to a given hazard than in the past
Increasing automation
- The third risk factor in technological societies
- Automation might appear to reduce the risk of human operator error
- However, it just moves the humans to other functions - maintenance, supervisory, higher-level decision making etc.
- The effects of human decisions and actions can then be extremely serious
- Automation has not removed human error but moved it elsewhere - someone will have designed the automated system (or the system which designed the system etc.)
Increasing centralisation and scale
- The fourth risk factor in technological societies
- Increasing automation has led to centralisation of industrial production in very large plants, giving the potential for great loss and damage in the event of an accident
Increasing pace of technological scale
- The fifth risk factor in technological societies
- The average time to translate a basic technical discovery into a commercial product was, in the early part of the twentieth century, 30 years.
- Nowadays it takes less than 5 years
- Economic pressures to change with technology
- May lead to less extensive testing
- Lessens the opportunity to learn from experience
- Impetus placed on shipping and selling so less time to iron out bugs or learn from experience
Common oversimplifications
- Interpreting statements about the cause of an accident requires care (the cause may be oversimplified or biased)
- Out of a large number of necessary conditions, one is often chosen as the cause, even though all factors were indispensable
There are three types:
1) Assuming human error
2) Assuming technical failures
3) Ignoring organisation factors
Assuming human error
- Often means that “the operator failed to step in and prevent the accident” which is not helpful when investigating the accident
- Is used far too often and it is unhelpful to blame the human controller in most accident investigations.
- E.g. Tesla driver died in first fatal autonomous car crash in May 2016
Assuming technical failure
- Don’t concentrate on only immediate physical factors such as component failure
- E.g. Flixborough Chemical Works explosion, 1974 (28 deaths). Errors in design and modification. Ruptured pipe that had been put in by management. Other factors were involved - no qualified engineer on site, far more chemicals than licence allowed were on site
Ignoring organisational factors
- Accidents are often blamed on computer/operator/equipment error, ignoring the underlying factors which make such accidents inevitable
- Accident causes are very frequently rooted in the organisation - its culture, management and structure
- E.g. Three Mile Island Nuclear Power plant 19 pages of recommendations from the investigation. Only 2 pages were technical; the other 17 were organisational
Causality
- The agency or efficacy that connects one process (the cause) with another process or state (the effect) where the first state is understood to be at least partially responsible for the second
- A given process many have many causes that lie in its past, known as causal factors.
Cause and effect
- A cause must precede a related effect, but problems finding causes arise because of two factors
1) A condition or event may precede another event without causing it
2) A condition may be considered to cause an event without the event occurring every time the condition holds - The cause of an event is composed of a set of conditions, each of which is necessary, and which together are sufficient for the event to occur
General classification of accident causes
- Accidents, in general, are the result of a complex process that causes system behaviour to violate safety constraints
- These constraints were put in place during the design, develop
ment, manufacturing and operation of a particular system - For a failure to occur, one or more of the following must have happened:
1) Safety constraints not enforced
2) Appropriate control actions provided but not followed
Safety constraints are not enforced
- The necessary control actions to enforce the safety constraint at each level of the control structure were not provided
- The necessary actions were provided at the wrong time or not allowed to complete
- Unsafe control actions were provided, causing a violation of the safety constraints
Appropriate actions provided but not enforced
- The control system/ structure provides the correct control actions to rectify the situation but they were not followed in the system’s context
Causal factors in accidents
1) Controller operation
2) Behaviour of actuators and controlled processes
3) Communication and coordination