Safety-critical systems Flashcards
Critical system
A system where failure or lack of availability has a serious human, environmental or economic effect
Critical system essentials
1) Safety: the system should not harm people or the system’s environment
2) Reliability: the system must operate without serious failures
3) Availability: the system must be available to deliver services when requested to do so
4) Security: the system must be able to protect itself and its data from malicious use
Reliability regime
- A description of how the system should operate when it fails
- There are five key regimes that many safety-critical systems satisfy
1) Fail-operational system
2) Fail-safe system
3) Fail-secure system
4) Fail-passive system
5) Fault-tolerant systems
Fail-operational system
- Fail-operational systems continue to operate when their control systems fail
- This failure mode is sometimes unsafe, and is hence not applicable for all systems
- Sometimes, such systems are ‘fail deadly’
Fail-safe system
- Fail-safe systems become safe when they are unable to operate
- This often involves disabling functionality and alerting operating staff
Fail-secure system
- Fail-secure systems maintain maximum security when they are unable to operate correctly
Fail-passive system
- Fail-passive systems continue to operate in the event of a system failure
- They will alert the operator to allow manual control to resume safely
Fault-tolerant systems
- Fault-tolerant systems avoid service failure when faults are introduced to the system
- This can often involve redundancy and hot-failover, but may also be used to describe systems that operate correctly at reduced capacity
Risk factors in technological societies
1) Increasing Complexity
2) Increasing Exposure
3) Increasing Automation
4) Increasing Centralisation and Scale
5) Increasing pace of technological change
Increasing complexity
- The first risk factor in technological societies
- High tech systems are often made up of networks of closely related subsystems
- A problem in one subsystem may cause problems in other subsystems
- Analyses of major industrial accidents invariably reveal highly complex sequences of events leading up to accidents rather than single component failures
- More subcomponents = More complexity
Increasing exposure
- The second risk factor in technological societies
- More people today may be exposed to a given hazard than in the past
Increasing automation
- The third risk factor in technological societies
- Automation might appear to reduce the risk of human operator error
- However, it just moves the humans to other functions - maintenance, supervisory, higher-level decision making etc.
- The effects of human decisions and actions can then be extremely serious
- Automation has not removed human error but moved it elsewhere - someone will have designed the automated system (or the system which designed the system etc.)
Increasing centralisation and scale
- The fourth risk factor in technological societies
- Increasing automation has led to centralisation of industrial production in very large plants, giving the potential for great loss and damage in the event of an accident
Increasing pace of technological scale
- The fifth risk factor in technological societies
- The average time to translate a basic technical discovery into a commercial product was, in the early part of the twentieth century, 30 years.
- Nowadays it takes less than 5 years
- Economic pressures to change with technology
- May lead to less extensive testing
- Lessens the opportunity to learn from experience
- Impetus placed on shipping and selling so less time to iron out bugs or learn from experience
Common oversimplifications
- Interpreting statements about the cause of an accident requires care (the cause may be oversimplified or biased)
- Out of a large number of necessary conditions, one is often chosen as the cause, even though all factors were indispensable
There are three types:
1) Assuming human error
2) Assuming technical failures
3) Ignoring organisation factors
Assuming human error
- Often means that “the operator failed to step in and prevent the accident” which is not helpful when investigating the accident
- Is used far too often and it is unhelpful to blame the human controller in most accident investigations.
- E.g. Tesla driver died in first fatal autonomous car crash in May 2016
Assuming technical failure
- Don’t concentrate on only immediate physical factors such as component failure
- E.g. Flixborough Chemical Works explosion, 1974 (28 deaths). Errors in design and modification. Ruptured pipe that had been put in by management. Other factors were involved - no qualified engineer on site, far more chemicals than licence allowed were on site
Ignoring organisational factors
- Accidents are often blamed on computer/operator/equipment error, ignoring the underlying factors which make such accidents inevitable
- Accident causes are very frequently rooted in the organisation - its culture, management and structure
- E.g. Three Mile Island Nuclear Power plant 19 pages of recommendations from the investigation. Only 2 pages were technical; the other 17 were organisational
Causality
- The agency or efficacy that connects one process (the cause) with another process or state (the effect) where the first state is understood to be at least partially responsible for the second
- A given process many have many causes that lie in its past, known as causal factors.
Cause and effect
- A cause must precede a related effect, but problems finding causes arise because of two factors
1) A condition or event may precede another event without causing it
2) A condition may be considered to cause an event without the event occurring every time the condition holds - The cause of an event is composed of a set of conditions, each of which is necessary, and which together are sufficient for the event to occur
General classification of accident causes
- Accidents, in general, are the result of a complex process that causes system behaviour to violate safety constraints
- These constraints were put in place during the design, develop
ment, manufacturing and operation of a particular system - For a failure to occur, one or more of the following must have happened:
1) Safety constraints not enforced
2) Appropriate control actions provided but not followed
Safety constraints are not enforced
- The necessary control actions to enforce the safety constraint at each level of the control structure were not provided
- The necessary actions were provided at the wrong time or not allowed to complete
- Unsafe control actions were provided, causing a violation of the safety constraints
Appropriate actions provided but not enforced
- The control system/ structure provides the correct control actions to rectify the situation but they were not followed in the system’s context
Causal factors in accidents
1) Controller operation
2) Behaviour of actuators and controlled processes
3) Communication and coordination
Controller operation
- Controller operation has three main parts, each of which may contribute to the lack of control actions
1) Control Inputs and Other External Information - Control actions flow through a system, so there is a risk of incorrect information being provided by another level or component of the system
- In these cases, the incorrect information may be acted upon correctly.
2) Control Algorithms - Algorithms may not enforce safety constraints due to inadequate design initially or unsafe modification to safe designs
- Time delays and lag must be taken into account when designing control routines, and sometimes these delays need to be inferred
3) The Process Model - The model used by the controllers must be consistent with the actual process state, otherwise these discrepancies may contribute to an accident through erroneous actions being taken
Hierarchical model of accident causes
Level 1: Mechanisms
- The succession of events
Level 2: Conditions
- Conditions (or lack of conditions) that allowed the events at Level 1 to occur
Level 3: Constraints
- Constraints (or lack of constraints) that allowed conditions to cause the events, or that allowed conditions to exist
Behaviour of actuators and controlled processes
- It is sometimes the case that while the control maintains the constraints that the controlled process may be unable to act on the commands
- This can stem from multiple causes, including:
1) Communication channel failure
2) Mechanical failure
3) Correct execution of safety inputs may depend on input from other system components - These kinds of flaws arise from system design and development
Communication and coordination
- When there are multiple sources of control it can be the case that control actions are not properly coordinated. This may result in unexpected side-effects or conflicts between control actions. This usually arises from communication flaws.
- Accidents appear to be more likely in boundary areas where multiple controllers control the same process (or processes with common boundaries). This is due to the potential for ambiguity and conflicts between decisions and often occurs due to poorly defined boundaries.
- Overlap areas occur where a function is achieved through the cooperation of multiple controllers or where multiple controllers influence the same object. This also creates the potential for conflicting control actions.
Root causes of accidents
1) Deficiencies in the safety culture of the industry or organisation
2) Flawed organisational structures
3) Superficial or ineffective technical activities
Deficiencies in the safety culture of the industry or organisation
- Safety culture: the general attitude and approach to safety reflected by those who participate in that industry. e.g. management, industry regulators, government regulators
- In an ideal world all participants are equally concerned about safety, both in the processes they use and in the final product. But this is not always the case; that’s why there is a requirement to have industry and government regulators
Deficiencies in the safety culture
Major accidents often stem from flaws in this culture, especially:
• Overconfidence and complacency
a) Discounting risk.
b) Over-reliance on redundancy.
c) Unrealistic risk assessment.
d) Ignoring high-consequence, low probability events.
e) Assuming risk decreases over time.
f) Underestimating software related risks.
g) Ignoring warning signs.
• Disregard or low priority for safety
• Flawed resolution of conflicting goals
Discounting risks
Major accidents are often preceded by the belief that they cannot happen
Over-reliance on redundancy
- Redundant systems use extra components to ensure that failure of one component doesn’t result in failure of the whole system.
- Many accidents can be traced back to common-cause failures in redundant systems.
- Common-cause failure happens when multiple redundant components fail at the same time for the same reason (e.g. fire, or electric outage)
- Providing redundancy may help if a component fails, BUT we must be aware that all redundant components may fail
Unrealistic risk assessment
It is quite common for developers to state that the probability of a software fault occurring in a piece of software is 10-4 , usually with little or no justification
Example:
Therac-25 software risk assessment was 10-11 for the event “computer selects wrong energy level”
Instead of launching an investigation when informed about possible overdoses, the manufacturer of the therac-25 responded that the risk assessment showed the accidents were impossible
Ignoring high-consequence, low-probability events
- A common discovery after accidents is that the events involved were recognized as being very hazardous before the accident, but were dismissed as incredible
Assuming risk decreses over time
- A common thread in accidents is the belief that a system must be safe because it has operated without any accidents for many years
- Risk may decrease, remain constant or increase over time
- It can increase due to operators becoming over-familiar with safety procedures and hence become lax or even miss them out
Underestimating software related risks
- There is a pervading belief that software cannot fail, and that all errors will be removed by testing
Ignoring warning signs
- Accidents are frequently preceded by public warnings or a series of minor occurrences
- Many basic mechanical safety devices are well tested, cheap, reliable and failsafe (based on physical principles to fail in a safe state)
Disregard or low priority for safety
- Problems will occur if management is not interested in safety, because the workers will not be encouraged to think about safety
- The Government may disregard safety and ignore the need for government/industry watchdogs and standards committees
- In fact these often only appear after major accidents
• The entire organisation must have a high level of commitment to safety in order to prevent accident.
• The lead must come from the top and permeate every organizational level
Flawed resolution of conflicting goals
- The most common one is the cost-safety trade-off or appears to cost more money at the time of development
- Often cost becomes more important and safety may therefore be compromised in the rush for greater profits
Flawed organisational structures
Many accident investigations uncover a sincere concern for safety in the organisation, but find organisational structures in place that were ineffective in implementing this concern.
1) Diffusion of responsibility and authority
2) Lack of independence and low status of safety personnel
3) Poor and limited communications channels
Diffusion of responsibility and authority
- Accidents are often associated with ill-defined responsibility and authority for safety matters
- Should be at least 1 person with overall responsibility for safety, they must have real power within the company
Lack of independence and low status of safety personnel
- This leads to their inability or unwillingness to bring up safety issues - e.g. Safety officers should not be under the supervision of the groups whose activities they must check
- Low status means no involvement in decision making
Poor and limited communication channels
- In some industries, strict line management means that workers report only to their direct superiors
- Problems with safety may not be reported to interested parties and as a result, safety decisions may not be reported back to the workers
- All staff should have direct access to safety personnel and vice versa
Superficial or Ineffective Technical Activities
This is concerned with poor implementation of all the activities necessary to achieve an acceptable level of safety
1) Superficial safety efforts
2) Ineffective risk control
3) Failure to evaluate changes
4) Information deficiencies
Superficial safety efforts
- Any efforts to ensure safety only take place at a superficial level, with no substantive action taken about any issues discovered and recorded
Example:
- Hazard logs kept but no description of design decisions taken or trade-offs made to mitigate/control the recognised hazards
- No follow-ups to ensure hazards have ever been controlled
- No follow-ups to ensure safety devices are kept in working order
Ineffective risk control
- Know risks, but little effort is placed in controlling them
- The majority of accidents are not the results from a lack of knowledge about how to prevent them
- They are the results of failure to use that knowledge effectively when trying to fix the problem
Failure to evaluate changes
- Accidents often involve a failure to re-evaluate safety after changes are made
- Any changes in hardware or software must be re-evaluated to determine whether safety has been compromised
- Quick fixes often affect safety because they are not evaluated properly
- For software, this would comprise of a regression test plus a system and software safety analysis
Information deficiencies
- Feedback of operational experience is one of the most important sources of designing, maintaining and improving safety, but is often overlooked
- Case studies are valuable for assessing hypotheses and forming intuition for mistakes
- There are 2 types of data that are important:
• Information about accidents/incidents for the system itself
• Information about accidents/incidents for similar systems
ADS
- Stands for Automated Driving System
Modelling accidents
- Accident models attempt to reduce an accident description to a series of events and conditions that account for the outcome
- Such models are used to:
1) Understand past accidents (a way to organise data and set priorities)
2) Learn how to prevent future accidents (predictive modeling)
Model
A representation of a system, made from the composition of concepts, that is used to help people understand and simulate the subject represented by the model
Domino model
- The general accident sequence is mapped onto five “dominoes” in the following order:
1) Ancestry or, social environment
2) Fault of person
3) Unsafe act or condition
4) Accident
5) Injury - Once one domino “falls”, it causes a chain of falling dominoes until the accident occurs
- Removing any of the dominoes will break the sequence
- Although removing any domino will prevent the accident, it is generally considered that the easiest and most effective domino to remove is domino 3: unsafe act or condition
- This model has been very influential in accident investigations, but has often been wrongly used to look for a single unsafe act or condition, when causes were actually more complex
Chain of events model
- Organise causal factors into chains of events
- Events are chained in chronological order, but there is often no obvious stopping point when tracing back from the cause of an accident
- This model is very close to our view of accidents, where we often try to rationalise it into a series of events
- As with the domino model, if the chain is broken, the accident won’t happen
- Thus, accident prevention measures concentrate on either:
1) Eliminating certain events or conditions
2) Intervening between events of the chain
3) Adding enough AND gates
Failure
- Failure is the inability of a system or component to fulfil its operational requirement, i.e., to perform its intended function for a specified time under specified environmental conditions.
- Failure is an event or behaviour which occurs at a particular instant in time
Error
An error is a design flaw or deviation from a desired or intended state
- An error is a static condition, a state, which remains until it is removed (usually through human intervention)
- An error may lead to an operational failure
Fault
- A fault is a hardware or software defect which resides temporarily or permanently in the system
- Faults are higher order events
- All failures are faults, but not all faults are failures
Example:
- If a relay fails to close properly when a voltage is impressed across its terminals, then this event is a relay failure
- If the relay closes at the wrong time due to the improper functioning of some upstream component, then the relay has not failed, but untimely relay operation may well cause the entire circuit to enter an unsatisfactory state - this event is called a fault
Accident
- An undesired and unplanned (but not necessarily unexpected) event that results in (at least) a specified level of loss
- An accident for any particular system must be defined by level of loss involved
- Loss: life, property or environment; immediate or long-term
- Example: two aircrafts coming within a pre-defined distance from each other and colliding
Near-miss or incident
- An event which involves no loss (or only minor loss) but with the potential for loss under different circumstances
- An incident for a particular system depends on how accidents are defined for the system
- Example: two aircrafts coming within a pre-defined distance from each other but not colliding
Hazard
- A state of the system that may give rise to an accident
- Hazard is a situation in which there is actual or potential danger to people or to the environment
- Hazard is specific to a particular system and is defined w.r.t the environment of the system/object
Risk
A combination of the likelihood of an accident and the severity of the potential consequences
Safety
- Freedom from accidents or losses
- It is often argued that there is no such thing as absolute safety, but instead a thing is safe if the attendant risks are judged acceptable