Safety-critical systems Flashcards

1
Q

Critical system

A

A system where failure or lack of availability has a serious human, environmental or economic effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Critical system essentials

A

1) Safety: the system should not harm people or the system’s environment
2) Reliability: the system must operate without serious failures
3) Availability: the system must be available to deliver services when requested to do so
4) Security: the system must be able to protect itself and its data from malicious use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Reliability regime

A
  • A description of how the system should operate when it fails
  • There are five key regimes that many safety-critical systems satisfy

1) Fail-operational system
2) Fail-safe system
3) Fail-secure system
4) Fail-passive system
5) Fault-tolerant systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Fail-operational system

A
  • Fail-operational systems continue to operate when their control systems fail
  • This failure mode is sometimes unsafe, and is hence not applicable for all systems
  • Sometimes, such systems are ‘fail deadly’
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Fail-safe system

A
  • Fail-safe systems become safe when they are unable to operate
  • This often involves disabling functionality and alerting operating staff
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Fail-secure system

A
  • Fail-secure systems maintain maximum security when they are unable to operate correctly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Fail-passive system

A
  • Fail-passive systems continue to operate in the event of a system failure
  • They will alert the operator to allow manual control to resume safely
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Fault-tolerant systems

A
  • Fault-tolerant systems avoid service failure when faults are introduced to the system
  • This can often involve redundancy and hot-failover, but may also be used to describe systems that operate correctly at reduced capacity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Risk factors in technological societies

A

1) Increasing Complexity
2) Increasing Exposure
3) Increasing Automation
4) Increasing Centralisation and Scale
5) Increasing pace of technological change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Increasing complexity

A
  • The first risk factor in technological societies
  • High tech systems are often made up of networks of closely related subsystems
  • A problem in one subsystem may cause problems in other subsystems
  • Analyses of major industrial accidents invariably reveal highly complex sequences of events leading up to accidents rather than single component failures
  • More subcomponents = More complexity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Increasing exposure

A
  • The second risk factor in technological societies
  • More people today may be exposed to a given hazard than in the past
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Increasing automation

A
  • The third risk factor in technological societies
  • Automation might appear to reduce the risk of human operator error
  • However, it just moves the humans to other functions - maintenance, supervisory, higher-level decision making etc.
  • The effects of human decisions and actions can then be extremely serious
  • Automation has not removed human error but moved it elsewhere - someone will have designed the automated system (or the system which designed the system etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Increasing centralisation and scale

A
  • The fourth risk factor in technological societies
  • Increasing automation has led to centralisation of industrial production in very large plants, giving the potential for great loss and damage in the event of an accident
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Increasing pace of technological scale

A
  • The fifth risk factor in technological societies
  • The average time to translate a basic technical discovery into a commercial product was, in the early part of the twentieth century, 30 years.
  • Nowadays it takes less than 5 years
  • Economic pressures to change with technology
  • May lead to less extensive testing
  • Lessens the opportunity to learn from experience
  • Impetus placed on shipping and selling so less time to iron out bugs or learn from experience
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Common oversimplifications

A
  • Interpreting statements about the cause of an accident requires care (the cause may be oversimplified or biased)
  • Out of a large number of necessary conditions, one is often chosen as the cause, even though all factors were indispensable

There are three types:

1) Assuming human error
2) Assuming technical failures
3) Ignoring organisation factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Assuming human error

A
  • Often means that “the operator failed to step in and prevent the accident” which is not helpful when investigating the accident
  • Is used far too often and it is unhelpful to blame the human controller in most accident investigations.
  • E.g. Tesla driver died in first fatal autonomous car crash in May 2016
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Assuming technical failure

A
  • Don’t concentrate on only immediate physical factors such as component failure
  • E.g. Flixborough Chemical Works explosion, 1974 (28 deaths). Errors in design and modification. Ruptured pipe that had been put in by management. Other factors were involved - no qualified engineer on site, far more chemicals than licence allowed were on site
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Ignoring organisational factors

A
  • Accidents are often blamed on computer/operator/equipment error, ignoring the underlying factors which make such accidents inevitable
  • Accident causes are very frequently rooted in the organisation - its culture, management and structure
  • E.g. Three Mile Island Nuclear Power plant 19 pages of recommendations from the investigation. Only 2 pages were technical; the other 17 were organisational
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Causality

A
  • The agency or efficacy that connects one process (the cause) with another process or state (the effect) where the first state is understood to be at least partially responsible for the second
  • A given process many have many causes that lie in its past, known as causal factors.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Cause and effect

A
  • A cause must precede a related effect, but problems finding causes arise because of two factors
    1) A condition or event may precede another event without causing it
    2) A condition may be considered to cause an event without the event occurring every time the condition holds
  • The cause of an event is composed of a set of conditions, each of which is necessary, and which together are sufficient for the event to occur
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

General classification of accident causes

A
  • Accidents, in general, are the result of a complex process that causes system behaviour to violate safety constraints
  • These constraints were put in place during the design, develop
    ment, manufacturing and operation of a particular system
  • For a failure to occur, one or more of the following must have happened:
    1) Safety constraints not enforced
    2) Appropriate control actions provided but not followed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Safety constraints are not enforced

A
  • The necessary control actions to enforce the safety constraint at each level of the control structure were not provided
  • The necessary actions were provided at the wrong time or not allowed to complete
  • Unsafe control actions were provided, causing a violation of the safety constraints
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Appropriate actions provided but not enforced

A
  • The control system/ structure provides the correct control actions to rectify the situation but they were not followed in the system’s context
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Causal factors in accidents

A

1) Controller operation
2) Behaviour of actuators and controlled processes
3) Communication and coordination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Controller operation

A
  • Controller operation has three main parts, each of which may contribute to the lack of control actions
    1) Control Inputs and Other External Information
  • Control actions flow through a system, so there is a risk of incorrect information being provided by another level or component of the system
  • In these cases, the incorrect information may be acted upon correctly.
    2) Control Algorithms
  • Algorithms may not enforce safety constraints due to inadequate design initially or unsafe modification to safe designs
  • Time delays and lag must be taken into account when designing control routines, and sometimes these delays need to be inferred
    3) The Process Model
  • The model used by the controllers must be consistent with the actual process state, otherwise these discrepancies may contribute to an accident through erroneous actions being taken
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Hierarchical model of accident causes

A

Level 1: Mechanisms
- The succession of events
Level 2: Conditions
- Conditions (or lack of conditions) that allowed the events at Level 1 to occur
Level 3: Constraints
- Constraints (or lack of constraints) that allowed conditions to cause the events, or that allowed conditions to exist

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Behaviour of actuators and controlled processes

A
  • It is sometimes the case that while the control maintains the constraints that the controlled process may be unable to act on the commands
  • This can stem from multiple causes, including:
    1) Communication channel failure
    2) Mechanical failure
    3) Correct execution of safety inputs may depend on input from other system components
  • These kinds of flaws arise from system design and development
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Communication and coordination

A
  • When there are multiple sources of control it can be the case that control actions are not properly coordinated. This may result in unexpected side-effects or conflicts between control actions. This usually arises from communication flaws.
  • Accidents appear to be more likely in boundary areas where multiple controllers control the same process (or processes with common boundaries). This is due to the potential for ambiguity and conflicts between decisions and often occurs due to poorly defined boundaries.
  • Overlap areas occur where a function is achieved through the cooperation of multiple controllers or where multiple controllers influence the same object. This also creates the potential for conflicting control actions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Root causes of accidents

A

1) Deficiencies in the safety culture of the industry or organisation
2) Flawed organisational structures
3) Superficial or ineffective technical activities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Deficiencies in the safety culture of the industry or organisation

A
  • Safety culture: the general attitude and approach to safety reflected by those who participate in that industry. e.g. management, industry regulators, government regulators
  • In an ideal world all participants are equally concerned about safety, both in the processes they use and in the final product. But this is not always the case; that’s why there is a requirement to have industry and government regulators
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Deficiencies in the safety culture

A

Major accidents often stem from flaws in this culture, especially:
• Overconfidence and complacency
a) Discounting risk.
b) Over-reliance on redundancy.
c) Unrealistic risk assessment.
d) Ignoring high-consequence, low probability events.
e) Assuming risk decreases over time.
f) Underestimating software related risks.
g) Ignoring warning signs.
• Disregard or low priority for safety
• Flawed resolution of conflicting goals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Discounting risks

A

Major accidents are often preceded by the belief that they cannot happen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Over-reliance on redundancy

A
  • Redundant systems use extra components to ensure that failure of one component doesn’t result in failure of the whole system.
  • Many accidents can be traced back to common-cause failures in redundant systems.
  • Common-cause failure happens when multiple redundant components fail at the same time for the same reason (e.g. fire, or electric outage)
  • Providing redundancy may help if a component fails, BUT we must be aware that all redundant components may fail
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Unrealistic risk assessment

A

It is quite common for developers to state that the probability of a software fault occurring in a piece of software is 10-4 , usually with little or no justification

Example:
Therac-25 software risk assessment was 10-11 for the event “computer selects wrong energy level”
Instead of launching an investigation when informed about possible overdoses, the manufacturer of the therac-25 responded that the risk assessment showed the accidents were impossible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Ignoring high-consequence, low-probability events

A
  • A common discovery after accidents is that the events involved were recognized as being very hazardous before the accident, but were dismissed as incredible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Assuming risk decreses over time

A
  • A common thread in accidents is the belief that a system must be safe because it has operated without any accidents for many years
  • Risk may decrease, remain constant or increase over time
  • It can increase due to operators becoming over-familiar with safety procedures and hence become lax or even miss them out
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Underestimating software related risks

A
  • There is a pervading belief that software cannot fail, and that all errors will be removed by testing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Ignoring warning signs

A
  • Accidents are frequently preceded by public warnings or a series of minor occurrences
  • Many basic mechanical safety devices are well tested, cheap, reliable and failsafe (based on physical principles to fail in a safe state)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Disregard or low priority for safety

A
  • Problems will occur if management is not interested in safety, because the workers will not be encouraged to think about safety
  • The Government may disregard safety and ignore the need for government/industry watchdogs and standards committees
  • In fact these often only appear after major accidents
    • The entire organisation must have a high level of commitment to safety in order to prevent accident.
    • The lead must come from the top and permeate every organizational level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Flawed resolution of conflicting goals

A
  • The most common one is the cost-safety trade-off or appears to cost more money at the time of development
  • Often cost becomes more important and safety may therefore be compromised in the rush for greater profits
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Flawed organisational structures

A

Many accident investigations uncover a sincere concern for safety in the organisation, but find organisational structures in place that were ineffective in implementing this concern.

1) Diffusion of responsibility and authority
2) Lack of independence and low status of safety personnel
3) Poor and limited communications channels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Diffusion of responsibility and authority

A
  • Accidents are often associated with ill-defined responsibility and authority for safety matters
  • Should be at least 1 person with overall responsibility for safety, they must have real power within the company
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Lack of independence and low status of safety personnel

A
  • This leads to their inability or unwillingness to bring up safety issues - e.g. Safety officers should not be under the supervision of the groups whose activities they must check
  • Low status means no involvement in decision making
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Poor and limited communication channels

A
  • In some industries, strict line management means that workers report only to their direct superiors
  • Problems with safety may not be reported to interested parties and as a result, safety decisions may not be reported back to the workers
  • All staff should have direct access to safety personnel and vice versa
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Superficial or Ineffective Technical Activities

A

This is concerned with poor implementation of all the activities necessary to achieve an acceptable level of safety

1) Superficial safety efforts
2) Ineffective risk control
3) Failure to evaluate changes
4) Information deficiencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Superficial safety efforts

A
  • Any efforts to ensure safety only take place at a superficial level, with no substantive action taken about any issues discovered and recorded

Example:

  • Hazard logs kept but no description of design decisions taken or trade-offs made to mitigate/control the recognised hazards
  • No follow-ups to ensure hazards have ever been controlled
  • No follow-ups to ensure safety devices are kept in working order
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Ineffective risk control

A
  • Know risks, but little effort is placed in controlling them
  • The majority of accidents are not the results from a lack of knowledge about how to prevent them
  • They are the results of failure to use that knowledge effectively when trying to fix the problem
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Failure to evaluate changes

A
  • Accidents often involve a failure to re-evaluate safety after changes are made
  • Any changes in hardware or software must be re-evaluated to determine whether safety has been compromised
  • Quick fixes often affect safety because they are not evaluated properly
  • For software, this would comprise of a regression test plus a system and software safety analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Information deficiencies

A
  • Feedback of operational experience is one of the most important sources of designing, maintaining and improving safety, but is often overlooked
  • Case studies are valuable for assessing hypotheses and forming intuition for mistakes
  • There are 2 types of data that are important:
    • Information about accidents/incidents for the system itself
    • Information about accidents/incidents for similar systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

ADS

A
  • Stands for Automated Driving System
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Modelling accidents

A
  • Accident models attempt to reduce an accident description to a series of events and conditions that account for the outcome
  • Such models are used to:
    1) Understand past accidents (a way to organise data and set priorities)
    2) Learn how to prevent future accidents (predictive modeling)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Model

A

A representation of a system, made from the composition of concepts, that is used to help people understand and simulate the subject represented by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Domino model

A
  • The general accident sequence is mapped onto five “dominoes” in the following order:
    1) Ancestry or, social environment
    2) Fault of person
    3) Unsafe act or condition
    4) Accident
    5) Injury
  • Once one domino “falls”, it causes a chain of falling dominoes until the accident occurs
  • Removing any of the dominoes will break the sequence
  • Although removing any domino will prevent the accident, it is generally considered that the easiest and most effective domino to remove is domino 3: unsafe act or condition
  • This model has been very influential in accident investigations, but has often been wrongly used to look for a single unsafe act or condition, when causes were actually more complex
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Chain of events model

A
  • Organise causal factors into chains of events
  • Events are chained in chronological order, but there is often no obvious stopping point when tracing back from the cause of an accident
  • This model is very close to our view of accidents, where we often try to rationalise it into a series of events
  • As with the domino model, if the chain is broken, the accident won’t happen
  • Thus, accident prevention measures concentrate on either:
    1) Eliminating certain events or conditions
    2) Intervening between events of the chain
    3) Adding enough AND gates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Failure

A
  • Failure is the inability of a system or component to fulfil its operational requirement, i.e., to perform its intended function for a specified time under specified environmental conditions.
  • Failure is an event or behaviour which occurs at a particular instant in time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Error

A

An error is a design flaw or deviation from a desired or intended state

  • An error is a static condition, a state, which remains until it is removed (usually through human intervention)
  • An error may lead to an operational failure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Fault

A
  • A fault is a hardware or software defect which resides temporarily or permanently in the system
  • Faults are higher order events
  • All failures are faults, but not all faults are failures

Example:

  • If a relay fails to close properly when a voltage is impressed across its terminals, then this event is a relay failure
  • If the relay closes at the wrong time due to the improper functioning of some upstream component, then the relay has not failed, but untimely relay operation may well cause the entire circuit to enter an unsatisfactory state - this event is called a fault
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Accident

A
  • An undesired and unplanned (but not necessarily unexpected) event that results in (at least) a specified level of loss
  • An accident for any particular system must be defined by level of loss involved
  • Loss: life, property or environment; immediate or long-term
  • Example: two aircrafts coming within a pre-defined distance from each other and colliding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Near-miss or incident

A
  • An event which involves no loss (or only minor loss) but with the potential for loss under different circumstances
  • An incident for a particular system depends on how accidents are defined for the system
  • Example: two aircrafts coming within a pre-defined distance from each other but not colliding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Hazard

A
  • A state of the system that may give rise to an accident
  • Hazard is a situation in which there is actual or potential danger to people or to the environment
  • Hazard is specific to a particular system and is defined w.r.t the environment of the system/object
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Risk

A

A combination of the likelihood of an accident and the severity of the potential consequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Safety

A
  • Freedom from accidents or losses
  • It is often argued that there is no such thing as absolute safety, but instead a thing is safe if the attendant risks are judged acceptable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Risk in terms of hazard

A
  • Risk is the hazard level (severity + likelihood of occurring) combined with likelihood of hazard leading to an accident (danger) and hazard exposure or duration (latency)
  • Exposure of hazard: the longer the hazardous state exists, the greater the chance that other prerequisite conditions occur
64
Q

The Safety Lifecycle

A
  • Identify the hazards and the accidents they may lead to
  • Assess the risk of these accidents
  • Reduce the risk, where possible and/or appropriate by eliminating/controlling the hazards that lead to the accidents
  • Set safety requirements for the system, and define how they will be met in the system design and implementation
  • Show that the safety requirements have been met in the system
  • Independent professional review
  • Provide a safety case for the system, usually for certification purposes
  • Seek safety approval
65
Q

Basic concepts of the safety lifecycle

A

1) Build in Safety rather than adding it onto a complex design
- Considerations of safety should be made part of the initial stages of conceptual development and requirements engineering
2) Deal with systems in their entirety
- Safety issues commonly arise at component boundaries as system components interact.
3) Analyse the current system rather than relying on past experience and standards
- Sometimes the pace of change doesn’t allow for experience to accumulate or for proven designs to be used
- Aim to anticipate safety issues before they occur
4) Examine hazards more widely than just component failures
- Hazards can arise in the requirements and design stages and hence be present even if the system components were operating as they should
5) Recognise the importance of tradeoffs and conflicts in system design
- Nothing is absolutely safe, and safety is not the only goal in building a system
- Safety should act as a constraint on possible system designs and interact with other system constraints
- This poses a Multi-Objective Optimisation problem

66
Q

SAE Levels of Automation

A

0: No Driving Automation
1: Driver Assistance
2: Partial Driving Automation
3: Conditional Driving Automation
4: High Driving Automation
5: Full Driving Automation

67
Q

No Automation

A
  • Zero autonomy; the driver performs all driving tasks
  • The human driver does all the driving
68
Q

Driver Assistance

A
  • Vehicle is controlled by the driver, but some driving assist features may be included in the vehicle design
  • An advanced driver assistance system (ADAS) on the vehicle can sometimes assist the human driver with either steering or braking/accelerating, but not both simultaneously
69
Q

Partial Automation

A
  • Vehicle has combined automated functions, like accleration and steering, but the driver must remain engaged with the driving task and monitor the environment at all times
  • An ADAS on the vehicle can itself actually control both steering and braking/accelerating simultaneously under some circumstances. The human driver must continue to pay full attention (“monitor the driving environment”) at all times and perform the rest of the driving task
70
Q

Conditional Automation

A
  • Driver is a neccesity, but is not required to monitor the environment. The driver must be ready to take control of the vehicle at all times with notice
  • An ADS on the vehicle can itself perform all aspects of the driving task under some circumstances. In those circumstances, the human driver must be ready to take back control at any time when the ADAS requests the human driver to do so. In all other circumstances, the human driver performs the driving task.
  • In a Level 3 vehicle, the driver always must be receptive to a request by the system to take back driving responsibilities. However, a driver’s ability to do so is limited by their capacity to stay alert to the driving task and thus capable of quickly taking over control, while at the same time not performing the actual driving task until prompted by the vehicle.
71
Q

High Automation

A
  • The vehicle is capable of performing all driving functions under certain conditions. The driver may have the option to control the vehicle.
  • An ADS on the vehicle can itself perform all driving tasks and monitor the driving environment - essentially, do all the driving - in certain circumstances. The human need not pay attention in those circumstances
72
Q

Full Automation

A
  • The vehicle is capable of performing all driving functions under all conditions. The driver may have the option to control the vehicle.
  • An ADS on the vehicle can do all the driving in all circumstances. The human occupants are just passengers and need never be involved in driving
73
Q

Computer and Software Myths

A

1) The cost of computers is lower than that of analogue or electromechanical devices
2) Software is easy to change
3) Computers provide greater reliability than the devices they replace
4) Increasing software reliability will increase safety
5) Testing and/or proving software correct (using formal verification techniques) can remove all the errors
6) Reusing software increases safety.
7) Computers reduce risk over mechanical systems

74
Q

Myth 1

A
  • The cost of computers is lower than that of analogue or electromechanical devices
  • There is a little truth in this as computer hardware is often cheaper than analogue/electromechanical devices
  • But the cost of designing, writing and certifying reliable and safe software + software maintenance costs can be enormous
75
Q

Myth 2

A
  • Software is easy to change
  • Superficially true: making changes to software is easy
  • However, there are three limitations:
    1) Making changes without introducing errors is very hard
    2) Any changes mean the software must be re-tested, re-verified (to check that it still meets its requirements) and re-certified for safety (as far as possible)
    3) Software tends to become more “brittle” as changes are made: i.e. the difficulty of making a change without introducing errors may actually increase over the lifetime of the software
76
Q

Myth 3

A
  • Computers provide greater reliability than the devices they replace
  • In theory this is true
  • Computer hardware is highly reliable and computer software doesn’t “fail” in the normal engineering sense
  • However, errors in software are (design) errors in the logic of what the software does, and they will affect system behaviour
77
Q

Myth 4

A
  • Increasing software reliability will increase safety
  • There are two views of reliability:
    1) Reliability in software is frequently defined as “compliance with the original requirements specification”
  • The myth is untrue because many safety-critical software errors can be traced back to errors in the requirements
  • It’s also untrue for the fact that many software-related accidents have occurred whilst the software was doing exactly what the requirements specification said it should
  • So compliance with the specification might increase, but that doesn’t do anything for safety if the specification is wrong

2) Reliability can also be viewed as “running without errors.”
- This myth is untrue because errors can be removed that have no effect on safety, so the system is more reliable but not any safer
- Safety and reliability, while partially overlapping, are not the same thing
- Increasing computer and software reliability does not necessarily result in increased system safety

78
Q

Myth 5

A
  • Testing and/or proving software correct (using formal verification techniques) can remove all the errors
  • This is untrue
  • Testing tools and verification of software are getting better and better
  • Testing only shows presence of bugs and not their absence

However:

1) The large number of possible paths through most realistically sized programs makes exhaustive testing impossible
2) It’s possible to try verifying that software meets its original specification, but, this won’t tell us if the specification itself is wrong. There are no formal techniques for checking the correctness of requirements specifications
3) It’s possible to do some formal verification of code - mostly mathematical proofs, but can only do this for small pieces of software

79
Q

Myth 6

A
  • Reusing software increases safety
  • Could be true for reliability, but untrue for safety
  • Unfortunately reusing software may actually reduce safety because: 1) It gives rise to complacency - “it worked in the previous system so it’ll work in this new system”
    2) Specific hazards of the new system cannot be considered in the original design and development of the reused software
80
Q

Myth 7

A
  • Computers reduce risk over mechanical systems
  • This is partially true
  • Computers have the potential to reduce risk, particularly where they are used to automate hazardous/tedious jobs
81
Q

Arguments regarding myth 7

A

1) They allow finer, more accurate control of a process/machine
- Yes, they can check process parameters more often, perform calculations quicker
- But this could lead to processes running with reduced safety margins, closer to its optimum

2) Automated systems allow operators to work farther away from hazardous areas
- Yes, in normal running of the process they do allow this
- But operators end up having to enter unfamiliar hazardous areas

3) Eliminating operators eliminates human errors
- Yes, operator errors are reduced
- But there are still human design and maintenances errors
- Humans are not eliminated but instead only shifted to different jobs

82
Q

Safety-critical software

A
  • Performs or controls functions which, if executed erroneously or if they fail to execute, could directly inflict serious injury to people and/or the environment and cause loss of life
  • Performs or controls functions which are activated to prevent or minimise the effect of a failure of a safety-critical system (this is sometimes called safety-related software)
83
Q

Software safety

A
  • Features which ensure that the software will execute within a system without contributing to hazards
  • A product performs predictably under normal and abnormal conditions
  • The likelihood of an unplanned event occurring is minimised and its consequences controlled and contained thereby preventing accidental injury and death
  • Must be able to deal with unexpected events. Perform in right way even if given wrong data
84
Q

Software safety features and procedures

A
  • Features are things that are designed into a product
  • Procedures must ensure that the system is used in an operational environment for which it was intended, and for the task for which it was intended
85
Q

Predictable software behaviour

A
  • Products are usually designed to perform under certain “normal” conditions
  • Safety critical systems must also be designed to take into account performance under abnormal conditions
  • If a product has been designed to perform predictably under normal and abnormal conditions, in theory, no unplanned events will occur
  • But loopholes that designers have not spotted can cause accidents for two reasons
    1: Designers are not guaranteed to have considered all normal and abnormal situations
    2: Human operators/users of systems tend to make mistakes, particularly when they are tired, bored, in a hurry or under pressure
86
Q

Considerations about software integration

A

Software safety is one of many components which determine overall system safety.
• How software contributes to or detracts rom the safety of this system
• What safety role is performed by the software
• How the software reduces the likelihood/severity of potential hazards

87
Q

Computer-based control systems

A
  • Computer-based control systems are obvious ways in which computers are used
  • However there are many circumstances where computers and software are involved in a much less obvious way
    • Software-generated data used to make safety-critical decisions
    • Software used for design
    • Database software that stores and retrieves information
88
Q

The usage of computers in safety-critical loops

A

1) Providing information or advice to a human controller upon request, perhaps by reading sensors directly
2) Interpreting data and displaying it to the controller, who makes the control decisions
3) ) Issuing commands directly, but with a human monitor of the computer’s actions providing varying levels of input
4) Eliminating the human from the control loop completely

89
Q

Providing information or advice to a human controller upon request, perhaps by reading sensors directly

A
  • If the computer gives incorrect process-control information, it may not be a big issue because the operator has direct access to controls and displays
  • However, a knowledgeable and experienced operator is needed
90
Q

Interpreting data and displaying it to the controller, who makes the control decisions

A
  • The computer reads the sensor outputs, interprets it and displays its interpretation to the operator
  • The operator no longer has direct access to the sensor output. The operator only has access to what that computer displays
  • Now, any misinterpretation by the computer will affect how the operator controls the process
  • The operator has lost direct feedback on the state of the process
91
Q

Issuing commands directly, but with a human monitor of the computer’s actions providing varying levels of input

A
  • Here the operator controls the process and receives sensor information only via the computer
  • The computer is interpreting both sides of the control loop
  • Thus it is even more important that the computer works correctly, since the operator is totally isolated from the process (e.g. fly-by-wire aircraft)
92
Q

Eliminating the human from the control loop completely

A
  • Computer assumes complete control of the process
  • The operator may provide advice or high-level direction
  • The operator only knows what the computer reveals about the state of the process, and has no access to the controls or sensors
  • In such automated systems the operator usually has to set the initial parameters for the process and starts it running, but the computer then takes over control
  • The implications of computer error or failure are now very serious
  • Upon error/failure, to control the process the operator may have to step in with little or no information to deal with the situation
93
Q

Phases of Hazard Management

A

1) Hazard identification
2) Hazard causal analysis
3) Hazard resolution and control
4) Hazard verification

94
Q

Hazard identification

A
  • Use: checklists, hazard indexes, event trees, HAZOPS
  • Aims to identify hazards that singly or in combination could lead to an accident
  • Check what hazards exist
  • Check what effects these hazards have
  • Check the complexity of these hazards
95
Q

Hazard (causal) analysis

A
  • Use: reliability block diagrams, failure modes and effects analysis, fault trees
  • Look at how these hazards occur
  • Evaluate the causal factors of the hazards
  • Determine if a particular causal factor responsible for several hazards
96
Q

Hazard resolution and control

A
  • Use: many, varied techniques, which, of course, depend on the type of hazard
  • Identify how hazards can be controlled and eliminated
  • Identify general design criteria that the design must meet
  • Identify safety devices and procedures
  • Identify specific design methods for reducing, controlling or indeed eliminating hazards
97
Q

Hazard verification

A
  • Use: many techniques such as documentation, testing, operational experience
  • Check whether identified hazards have been appropriately controlled for
  • Identify the remaining hazards
  • Consider their probability of occurrence, potential losses, cost of removing them
98
Q

Hazard analysis techniques

A

1) Brainstorming
2) Checklists
3) Event Tree
4) HAZOPS (HAzards and OPerability Studies)
5) Design Criteria
6) Fault Tree Analysis
7) RBD (Reliability Block Diagrams)

99
Q

Brainstorming

A
  • One of the most widely used methods of hazard analysis
  • Users must have background knowledge
  • This is an effective technique for hazard identification, even if it appears very simple
  • It is, however, essential to assemble a team with sufficient breadth of expertise.

Advantages:

  • Very simple
  • Can be applied to software

Disadvantages:

  • Needs a suitable team of experts
  • Can be time-consuming
100
Q

Checklists

A
  • Basic checklists are simple lists of hazards or specific design features
  • More complex ones involve open-ended questions

Advantages:

  • Embody a tremendous amount of existing experience which can be passed from project to project
  • Can be tailored to local practices and projects
  • Can be derived from standards and codes of practice, allowing them to test compliance
  • Guides thinking about the hazards in a system
  • Good for analysing well-understood systems with standard design features

Disadvantages:

  • Users may rely on them too much and ignore hazards not on the list
  • Can become very large, making them difficult to use and giving a false sense of security
  • Really only useful for simple systems
101
Q

Event trees

A
  • Given some initiating event, identify all possible outcomes by determining all sequences of events that could follow it
  • The event tree is drawn left to right, with branches under each component heading corresponding to two alternatives:
    (1) Component operates
    (2) Component fails

Advantages:

  • Useful for identifying the protection system features that contribute most to the probability of an accident
  • Identify top events for subsequent fault tree analysis
  • Display various accident scenarios that may result from a single initiating event
  • Can handle sequencing of events

Disadvantages:

  • Can become very complex, especially when a number of time-ordered interactions are involved
  • A separate tree is needed for each initiating event, thus it’s not feasible to consider multiple initiating events or interactions between initiating events
  • Doesn’t help in determining whether the paths lead to system failure. This must be determined by the operator of the tree using their knowledge of the system components
102
Q

HAZOPS

A
  • A structured and systematic examination of a complex planned or existing process or operation in order to identify and evaluate problems that may represent risks to personnel or equipment
  • A qualitative technique whose purpose is to identify all possible deviations from the design’s expected operation, and hazards associated with these deviations
  • HAZOPS analyses aim to consider:
    1) The design intention of the system
    2) Potential deviations from the design intention
    3) The causes of these deviations from the design intention
    4) The consequences of such deviations
  • The HAZOPS process operates as follows:
    1) Identification of entities in the system
  • An entity is an interconnection between components and the corresponding interactions
  • These encompass flows of physical material or signals and data
    2) Description of the Attributes for the Entities in the system
  • An attribute is a property of an entity which will help to determine the correctness of the system’s operation
    3) Each line in the system drawing is examined, and guidewords are applied to generate questions
    4) Each deviation generated is considered by the team to assess every potential cause and the effect on the system as a whole

Advantages:

  • Simple, easy to use, identifies problems at the design stage
  • Is focused on finding hazards
  • Does not concentrate on failures, but can uncover more complex types of hazardous events and causes
  • The basic idea is applicable to new designs, complex systems of which there is little experience, and procedures that occur infrequently

Disadvantages:

  • Time and effort required — it is very labour-intensive
  • Relies heavily on the judgment of those performing HAZOP analysis
  • Rarely considers organisational factors as it only observes the physical aspects of the system, without its surrounding context
103
Q

HAZOPS Guidewords

A
  • NO, NOT, MORE: The intended result is not achieved, but nothing else happens
  • MORE: More of any relevant physical property than there should be
  • LESS: Less of a relevant physical property than there should be
  • AS WELL AS: An activity occurs in addition to what was intended, or too many components are present
  • PART OF: Only some of the design intentions are achieved
  • REVERSE: The logical opposite of what was intended occurs
  • OTHER THAN: No part of the intended result is achieved, and something completely different happens
  • EARLY: Signal arrives before expected clock time
  • LATE: Signal arrives after expected clock time
  • BEFORE: Signal arrives earlier than intended within a sequence of signals
  • AFTER: Signal arrives later than intended within a sequence of signals
104
Q

Design Criteria

A
  • A particular hazard will usually have a related design criterion which is needed to avoid or mitigate the hazard
  • It is common to for these to be given as part of the hazard analysis process
  • A design criterion is a statement of what is to be achieved, not how it is achieved
  • That is, it states the goal that the designer must achieve
  • Can be used to derive requirements
105
Q

Fault Tree Analysis

A
  • A graphical method; means of analysing the underlying causes of hazards:
    1) Assume a particular system state
    2) Assume the required top event
    3) Write down the causal events related to the top event, and the logical relationships between them, going down the tree until the basic/primary events are reached
    4) Intermediate events are combined with logical operations (AND, OR, …)

Advantages:

  • Developing the tree forces a system-level examination, beyond the context of single components or subsystems
  • Graphical format provides a pictorial representation of event relationships. Humans are good at seeing patterns in graphics
  • Can assist in identifying scenarios leading to hazards and suggest possibilities for hazard elimination or control

Disadvantages:

  • Shows cause and effect relationships but very little else
  • A fault tree is a simplification of some complex process, and may leave out important information
  • Time and rate-dependent events are hard to represent
106
Q

Fault Tree Notation

A
  • The top or root node is called the “top event” — the hazard whose cause is to be analysed
  • Work backwards to determine its cause
107
Q

RBD (Reliability Block Diagram)

A
  • A technique for showing the subsystems which contribute to a potential hazard, so that only those contributing to a hazard need to be analysed in detail
  • RBDs can be used to model both what happens when components work and what happens when components fail

RBD process:

  1. Construct an appropriate block diagram for the system
  2. Define the system’s failure modes
  3. Construct the RBDs by connecting the blocks identified in step 1 into “success paths”
  4. Analyse the RBDs to determine which blocks of the system contribute to particular failure modes
108
Q

Converting an RBD

A
  • A RBD can be converted to a success tree by replacing series paths with AND gates and
    parallel paths with OR gates
  • This can then be converted to a fault-tree by applying De Morgan’s Theorem:
    1) ¬(A Λ B) = ¬A ¬B
    2) ¬(A V B) = ¬A V ¬B
  • In general, a Fault-Tree can be converted to a RBD, but it is generally more difficult to
    convert an RBD into a fault-tree, especially if the RBD uses highly complex configurations
109
Q

RBD representation for failure

A
  • If failure is being modelled (for a serial system) then the diagram represents failure if any component in the system fails
  • Failure rate will be the sum of the individual components’ failure rates
  • Equation: λ = λ₁ + λ₂ + … + λₙ
110
Q

RBD representation for success

A
  • If success is being modelled (for a serial system) then the diagram represents success if any component in the system works then the following diagram represents success if all components in the system work
  • Reliability will be the product of each component’s reliability
  • Equation: R(t) = R₁(t) * R₂(t) … * Rₙ(t)
111
Q

RBD representation for a parallel system

A

Failure:

  • If failure is being modelled (for a parallel system) then failure rate will be the product of the individual components’ failure rates
  • Equation: λ = λ₁ λ₂ … * λₙ

Success:
- If success is being modelled (for a parallel system) then reliability can only be calculated indirectly
- This starts by calculating the probability that an individual component will fail
- The probability of component i failing to function correctly (Q = unreliability) is: Qᵢ(t) = 1 - Rᵢ(t)
- The unreliability of the complete system must be the probability of all components failing independently
- If there are N components, with reliabilities R₁(t) … Rₙ(t), then for the complete system: Q(t) = (1 - R₁(t))(1 - R₂(t)) … (1 - Rₙ(t))
- The reliability of this system (if all N components have the same reliability) is then, with Q representing unreliability:
• R(t) = 1 - Q(t)
• = 1 - [(1 - R₁(t))(1 - R₂(t))…(1 - Rₙ(t))]
• = 1 - [1 - Rᵢ(t)]ᴺ

Overall:
- The diagram represents success if one of the components in the system works as well as if all components in the system fail

112
Q

Event tree vs fault tree

A

Event Tree:

  • Initiating event (may/may not lead to a hazard)
  • Work forward in time to determine all possible sequences of events arising from the initiating event

Fault Tree:

  • Identified Hazard
  • Work backwards in time to determine the intermediate events (causes) leading to the hazard
  • Event trees are very good at determining possible hazards because we can look at the hazards it shows us
  • Fault trees allow us to work out how the hazards occurred
113
Q

Bowtie technique

A

Combines fault tree with event tree

114
Q

Failure rate

A
  • The number of failures that occur in a given period of time
  • Denoted as λ
115
Q

Reliability and unreliability

A

Reliability:

  • Denoted as R(t), reliability is the probability of a device functioning correctly over a given period of time under a given set of operating conditions
  • Formula: R(t) = e^(-λt)

Exponential failure law:
- States that for a constant failure rate, reliability falls exponentially with time

Unreliability:

  • Denoted as Q(t), unreliability is the probability of a device failing to function correctly over a given period of time
  • Formula: Q(t) = 1 - R(t)
116
Q

MTBF

A
  • Stands for Mean Time Between Failures
  • Denoted as: 1/λ
117
Q

Cognition

A
  • The mental action or process of acquiring knowledge and understanding through thought, experience and the senses
118
Q

Attention

A
  • The behavioural and cognitive process of selectively concentrating on a discrete aspect of information, whether deemed subjective or objective, while ignoring other perceivable information
119
Q

Perception

A
  • The process of organising, identifying and interpreting sensory
    information in order to represent and understand the environment
120
Q

Working Memory

A
  • The memory that provides a space for the combination of
    sensory and memory information that provides conscious
    cognition
  • Roughly speaking it can handle 7 ± 2 chunks of information with around 70ms rapid access times
  • It doesn’t persist beyond 200ms unless actively maintained
121
Q

Long-Term Memory

A
  • The repository of all knowledge gathered throughout life
  • It has long access times ranging from 100ms to days, and
    seems to exhibit cache-like behaviour, with recent information being easier to access
122
Q

Types of Long-Term Memory

A

1) Semantic
- Knowledge-based memory that stores information about the world - It stores meaning such as facts, as well as information about the relationships between certain things
- Hence, it contains conceptual connections between huge numbers of items
- This idea is what spawned the concept of a neural network
- It is built up by association via experience and exposure

2) Skill
- Memory that retains information on how to perform certain tasks
- This tends to consist of semantic scripts/frames for performing certain tasks - template like knowledge for certain situations
- These provide us with expectations and allow us to predict outcomes
- They tend to consist of a generalized outline of a particular situation

3) Episodic
- Memory that retains information about what has happened, either factual or fictitious
- This memory usually consists of memory of key objects, actions and agents
- It is usually remembering through reconstruction more than simple playback of mental recordings
- The gaps are filled in with the help of frame semantics

123
Q

Reasoning

A
  • The use of domains of knowledge and understanding to interpret and formulate theories about the functioning of the world around the individual
  • Reasoning can take place within domains of knowledge, using familiar structures and common concepts, or can be via the application of pseudo-logic to combine rules and experience to be rational within sets of knowledge
124
Q

Main processes of reasoning

A

1) Intuitive
- This type of reasoning is parallel, rapid and implicit and is provided by default heuristic processes for everyday cognitive tasks
2) Deliberative
- These are singular and conscious reasoning processes that are often strategic and abstract for demanding cognitive problems

125
Q

HCI

A
  • A discipline concerned with the design, evaluation and implementation of interactive computing systems for human use and with the study of
    major phenomena surrounding them
    1) The scientific study and formulation of engineering principles for interactions with computer systems
    2) The matching of computational tools to human needs and aspirations
    3) The amplification of human potential
126
Q

Prototyping Types

A

Low-Fidelity Prototypes:

  • Loose exemplifiers of the proposed system that are quick and cheap to build and modify
  • Unfinished UX lets users suggest alterations and interact creatively
  • Examples include sketches (static pictures and text), storyboards (dynamic interactions as a series of images), index cards (cards for parts of interactions to be stepped through), wizard-of-oz (faking functionality with a human operator)

High-Fidelity Prototypes:

  • These aim to represent the final prototype as accurately as possible, both in materials and design
  • Useful for exploring dynamic navigation flow and technical feasibility, as well as obtaining user feedback on dynamic interactions
  • Complete functionality with clearly defined a navigational schema, and great for exploration and testing
  • Can be expensive
  • They are usually made in rapid-prototyping languages

Issues with prototyping:

  • Paper prototypes make it hard to capture the dynamics of software
  • Software prototyping can result in slow programs with bad icons and limited functionality
  • Horizontal Prototypes: A wide range of functionality but with little detail
  • Vertical Prototypes: A depth of detail but with few functions
  • Evolutionary vs. Throw-Away Prototypes: There could be concerens over whether the prototype can really be used as a stepping stone to the final product
127
Q

Implementation

A

Elements that should be considered:

1) Visual and Auditory Feedback:
- The use of sound, video, animation (2D and 3D) for feedback to the users
- Care must be taken to ensure that such displays convey information in a clear and unambiguous fashion

2) Interaction Schemes
- Consideration of what interaction schemes are most appropriate for the system to prevent user error and reduce user cognitive load as much as possible
- Use of hardware controls, touch, mouse and keyboard, and other interaction paradigms

3) Device Paradigm:
- Size and portability of the device should be appropriate for the task at hand
- This can also consider whether the interaction paradigm should be ubiquitous or wearable

4) Input Methodologies
- Implementing appropriate input and control schemes for the task at hand
- This can be common schemes, but also involve gestures, speech
recognition and other more advanced interaction techniques
- This should also consider system accessibility

5) System Context:
- The system should be aware of its context and how it interacts with it and the wider system
- This will reduce the cognitive overhead for users in interacting with the system

6) Personalisation:
- Under some circumstances it is appropriate to let users customise their interaction with the system, as this sometimes has the potential to further reduce cognitive load

128
Q

Evaluation

A
  • A set of techniques that aim to assess the usability of a system
  • It involves an examination of the interactive system to reason about the effectiveness and efficiency of the system
  • It is essential to evaluate the design of the safety-critical system’s interaction to ensure that it reduces the scope for users to make errors
129
Q

Heuristic Evaluation

A
  • The process of applying a set of ‘rules’ for interaction design to an interface to determine which elements follow or violate the rules
  • It can be applied at all stages of system design
130
Q

Nielsen’s Usability Heuristics

A

1) Visibility of System Status
- The system should always keep users informed about what is going on through appropriate feedback in a reasonable time-frame

2) Match between the system and the real world
- The system should speak the domain language with words, phrases and concepts familiar to the user
- Information should appear in a natural and logical order

3) User control and freedom
- Users often choose system functions by mistake and will need a clearly marked exit to leave any unwanted states
- Undo and redo support where relevant

4) Flexibility and efficiency of use
- Accelerators may speed up interaction for the expert user so it can cater for both novice and expert users

5) Help and documentation
- Any documentation should be easy to search and focused on the user’s task
- It should not be too cumbersome and should list concrete steps

6) Recognition rather than recall
- Minimise user memory load by making object, actions and options visible
- The user should not have to remember information while moving around the interface

7) Error prevention
- Even better than good error messages is a design that prevents error states
- Either eliminate error-prone conditions or check for them and use confirmation dialogues

8) Help users to recognise, diagnose and recover from errors
- Error messages should be in plain language and could also suggest a solution

9) Aesthetic and minimalist design
- Dialogues and the interface should not contain information that is irrelevant or rarely needed
- Extra information decreases the relative visibility of other information

10) Consistency and standards
- Users should not have to wonder whether different words, actions or situations mean the same thing
- Follow platform conventions

131
Q

Usability Testing

A
  • Observation of users interacting with the system with the intention of uncovering problems with the system that compromise the interaction and safety or impose additional cognitive load on users
132
Q

Elements of a usability test

A

1) Make Participants Comfortable
- They may feel that they are being tested rather than the system

2) Think-Aloud Protocols
- Verbal reports from the person completing the task
- Can interfere with task performance or relies on memory being reliable, but they provide detailed information on goals, planning and actions

3) Task-Oriented
- Provide the users with a set of tasks to complete

4) Probing
- Whether interaction with the user is during or after the testing should dive into the interaction to find usability issues

5) Neutrality
- Participants can tend to try to please the testers, so this should be considered in the assessment

133
Q

Model Human Processor

A
  • An engineering model of the human informational processes that allows for the calculation of basic performance parameters and consists of the following subsystems:

1) Perception
- The information received from the various senses about the world around us and how we respond to change. This is based on change, so all receptors respond to change, and attentional filtering in the brain allows change to be spotted. It is a cognitive process that allows an understanding of what is happening in the surrounding world.

2) Working Memory
- The memory used for conscious cognition

3) Long-Term Memory
- The memory used for the storage and recall of information

4) Action
- The results of cognition and reasoning about the surrounding world

134
Q

Distributed Cognition

A
  • A subset of external cognition that involves the interactions
    between individuals and artefacts, using internal and external representations
  • It focuses on information propagation as information is processed by an entire socio-cognitive system and is transformed through different media
  • It takes a distributed view of problem solving
135
Q

Situational Awareness

A
  • The perception of elements in the environment within a volume of time and space
  • The comprehension of their meaning
  • The projection of their status in the near future
  • Is highly important in work domains with high information flow and where poor decisions may lead to serious consequences
  • It provides a critical foundation for successful decision making in complex and dynamic systems
136
Q

Team Situational Awareness

A
  • The degree to which every team member possesses the SA required for their responsibilities
137
Q

SA Error Taxonomy Levels

A

Level 1: failure to correctly perceive the information
Level 2: failure to correctly integrate or comprehend information
Level 3: Failure to project future actions or state of the system

138
Q

SA Error Taxonomy Factor Table

A

Level 1

1) Data not available
2) Data difficult to detect/perceive
3) Failure to scan or observe data
a) Omission
b) Attentional narrowing/distraction
c) High taskload
4) Misperception of data
5) Memory failure

Level 2

1) Lack of/poor mental model
2) Use of incorrect mental model
3) Over-reliance on default values in model
4) Memory failure
5) Other

Level 3

1) Lack of/poor mental model
2) Other

Other

1) Maintaining Multiple Goals
2) Habitual Schema
3) General

139
Q

Verification of Safety

A
  • The process of determining whether a component or the output of a development phase meets its specification, as well as whether the system safety constraints and the safetyrelated functional requirements are correct
  • It includes both whether the specification is correct, and whether the system meets it specification
  • Verification of safety goes beyond software verification, including:
    1) System Safety Constraints
    2) Safety-related Functional requirements
140
Q

Dynamic Analysis

A
  • The execution of a system or component to investigate
    its characteristics and behaviour
  • It involves testing for hazardous outputs and direct testing of safety design features (those that are included to eliminate or control hazards)
  • Dynamic testing is often success-oriented, focusing on what the software is supposed to do, rather than what it is not supposed to do
141
Q

Static Analysis

A
  • The evaluation of a system or component to investigate its
    characteristics without executing the system
  • This is often more time-consuming than dynamic analysis as the techniques utilised are complex
142
Q

Black-box and white-box testing

A
  • Black-box testing treats the component as a black box, without access to the source code
  • In this context, the only knowledge of the component comes from its specification
  • The aim is to design tests (with expected results) based only on the specification of the system, and checking the results and behaviours that the system produces against the expected results
  • It is impossible to test every possible input, so focus is placed on common modes of operation and critical modes of operation
  • In white-box testing, the code is visible to those performing the tests
  • In this kind of testing, knowledge about the system can come from both its specification and the code itself
  • The aim is to design tests (with expected results) to force execution through all possible paths in the code, with a particular focus on critical paths and boundary conditions
  • It is almost invariably not complete, but having access to the source allows the use of coverage metrics and other testing metrics
143
Q

Principles of Safety Management

A

1) Use the Safety Lifecycle
2) Have a Safety Culture
3) Ensure Competence and Training Principle
4) Ensure Accountability Principle
5) Have a Safety Management System

144
Q

Design Review

A
  • A milestone during product development where a design is evaluated against its requirements in order to verify the outcomes of previous activities
  • They are key parts of design controls for Safety-Critical Systems
145
Q

Formal Proof

A
  • Proving that a program satisfies a formal specification of its behaviour
  • This can be via deductive verification, abstract interpretation, automated theorem proving, type systems and lightweight formal methods
146
Q

Data Flow Analysis

A
  • A technique for gathering information about the possible
    set of values calculated at various points in a
    program
  • A program’s control-flow graph is used to determine the parts of a program to which a particular value might propagate
147
Q

Modelling

A
  • The creation of models in order to help people understand and simulate the subject of the model

Common modelling techniques:

1) Formal Methods:
- Use of mathematical languages to model system components at the specification level
- This allows the specification to be checked for correctness and consistency through formal proof

2) Software Prototyping:
- Partial implementations of the system used to investigate key system features
- Often used in UI design or component conceptualisation

3) Performance Modelling:
- Modelling the processes in the system and the ways in which they interact, adding appropriate input data
- This aims to check the working capacity of the system

4) State-Transition Diagrams:
- Validation of the state transition requirements in the specification
- Ensures that all states are reachable and that appropriate intermediate states are used where necessary, and thus helping to build a comprehensive model of the system’s states

5) Environmental Modelling:
- Use of a simulation environment, that simulates the environment in which the system is used, to test the system

148
Q

Validation

A
  • Validation aims to ensure that the specification for the software system is complete and correct
  • It cannot be assumed that these properties hold, and the specification should be rigorously examined to ensure that it encompasses all elements of the software’s functionality and all edge cases
149
Q

Verification

A
  • Once the specification has been validated it can be used to compose a set of test cases
  • These should cover all elements of the specification, including: boundary values, expected values, extreme values
150
Q

Determining test coverage

A
  • In general, it is impossible to have tested enough except on the most trivial of programs
  • There is always a possibility of bugs, and this possibility should be dealt with at an organisational level
  • In general, there are certain things that companies can do to ensure better testing:
    1) Test coverage metrics (for white-box testing) to determine how much of the codebase has been tested
    2) Testing by experts, who know what to look for and break more efficiently than novices
    3) Manual introduction of bugs: this allows the measurement of how many potential bugs remain: SD/S = UD/U where SD = Seeded Detected, S = Seeded, UD = Unseeded Detected, U = Unseeded.
151
Q

Use the Safety Lifecycle

A

1) Safety Plan
• Safety plan is a description of how the project will meet the safety requirements
• Must be submitted to the appropriate Safety Authority who will check that the course of action being proposed is safe
• Approval by the Safety Authority provides assurance that the approach documented in the Safety Plan is appropriate

2) Documentation and Records
• Records must be kept to prove that the Safety Plan has been followed, including: records of design activities, tests, meetings etc.
• Particularly important is the Hazard Log describing:
1) Hazards/accidents identified
2) Action to remove them or reduce them

3) Independent Professional Review
• Involves checks by independent engineers providing assurance and supporting the Safety Case
• Frequency and depth of each type of review depends on the risks involved
• Two types: Audits (examinations of process) and Assessments (examinations of and product or system)

4) Safety Case
- A document that provides an argument for the safety of the project
- It is a compilation of evidence:
1) records
2) independent review
- Objective is usually to gain certification

5) Safety Approval
• Approval from Safety Authority (usually some standards body), which is based on the safety case
• Issue of Safety Certificate upon approval

152
Q

Have a Safety Culture

A
  • The difficulties of management are primarily related to people, and the management of safety is no exception
  • However:
    • Competent staff cannot alone ensure safety
    • Flaws in the safety culture of an organisation are a common cause of accidents
    • Organisations must be run so that safety is seen as the primary goal, with consciousness of safety evident in every activity
153
Q

Ensure Competency and Training Principle

A
  • All staff involved in safety-related activities must be competent in terms of training, technical knowledge, experience and qualifications
  • Inter-personal skills are also very important such as teamwork and acceptance of constructive criticism
  • The level required of all the above will depend upon the particular task being undertaken
154
Q

Ensure Accountability and Responsibility

A
  • Accountability: assessment and measurement of results
    • The person ultimately accountable for safety is the Safety Authority.
  • Responsibility for safety can be delegated, accountability cannot
155
Q

Have a Safety Management System

A
  • Part of achieving a safety culture is getting every project to use the same best practice processes and engineering techniques
  • Best management practice to achieve this is:
    • Establish a Safety Management System — a documented process that embodies all the principles we have discussed for enhancing safety
    • Ensure that the Safety Management System is auditable and can be easily changed as best practice evolves
156
Q

Final Considerations

A
  1. Safety is a System Issue
    Safety of system components cannot be considered in isolation. Software safety can only be evaluated in the context of the system in which it operates.
  2. Build Safety Into Systems:
    It is much more effective to build safety into a system than to add protection systems to a completed design. The earlier safety is considered in development, the better the results will be.
  3. Simplicity:
    Systems that are simple and intellectually manageable are more likely to be safe. Simple systems have only the features required, and should not be confused with simplistic solutions that do not adequately account for the complexities of an issue. This is contrasted with complex systems that have unneeded features and that are over-engineered (more complex than necessary for its application).
  4. Avoid Complacency:
    This is the most important risk factor in the development and use of safety-critical systems, and a safety culture must be established to minimise it.
  5. Safety and Reliability:
    These are different things and should not be confused. While safe systems are generally reliable, reliable systems may be unsafe.
  6. Hazard Identification and Analysis:
    This is essential for designing safety into a system, as hazards evolve over time.
  7. Do not Over-Simplify:
    Accidents are complex and caused by many factors. Fixing symptoms and ignoring root-level causes will not prevent the repetition of most accidents. Concentrating on a subset of the aspects of the problem will not result in an effective safety program (e.g. not accounting for cultural issues).
  8. Learn:
    From past accidents and avoid repeating past mistakes.
  9. Humans and Computers:
    Replacing humans with computers will not necessarily make systems safer.
  10. Computer and Software Use:
    Only use computers and software if their advantages outweigh their disadvantages
157
Q

Other types of critical system

A
  • Mission critical
  • Business critical
  • Security critical