Chapter 11 Flashcards

1
Q

What is human error or mistake?

A

Human behavior that results in the introduction of faults into a system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is system fault?

A

A characteristic of a software system that can lead to a system error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is system error?

A

An erroneous system state that can lead to system behavior that is unexpected by system users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is system failure?

A

An event that occurs at some point in time when the system does not deliver a service as expected by its users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Fill in the gaps:
_ are usually a result of system _ that are derived from _ in the system

A

Failures, errors, faults

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Do faults necessarily result in system errors?

A

No, if the erroneous system state is transient and can be ‘corrected’ before an error arises

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Do errors necessarily lead to system failures?

A

No, if the error is corrected by built-in error detection and recovery mechanism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Fault management strategies to achieve reliability

A

Fault avoidance, Fault detection and removal, Fault tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is fault avoidance?

A

Development techniques used that either minimize the possibility of mistakes or trap mistakes before they result in the introduction of system faults.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is fault detection and removal?

A

Verification and validation techniques that increase the probability of detecting and correcting errors before the system goes into service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is fault tolerance?

A

Run-time techniques used to ensure that system faults do not result in system errors and/or that system errors do not lead to system failures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is reliability?

A

the probability of failure-free system operation over a specified time in a given environment for a given purpose. Can be expressed quantitatively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is availability

A

the probability that a system, at a point in time, will be operational and able to deliver the requested services. Can be expressed quantitatively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Does the formal definition of reliability always reflect the user’s perception of a system’s reliability?

A

No, reliability can only be defined formally with respect to a system specification i.e. a failure is a deviation from a specification. Users don’t read specifications and don’t know how the system is supposed to behave; therefore, perceived reliability is more important in practice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Availability is usually expressed as a percentage of time that the system is available to deliver services. What are the drawbacks of this?

A

It doesn’t take into account 2 factors:
The number of users affected by the service outage. Loss of service in the middle of the night is less important for many systems than loss of service during peak usage periods.
The length of the outage. The longer the outage, the more the disruption. Several short outages are less likely to be disruptive than 1 long outage. Long repair times are a particular problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Will removing X% of the faults in a system necessarily improve the reliability by X%

A

No, program defects may be in rarely executed sections of the code so may never be encountered by users. Removing these does not affect the perceived reliability

17
Q

Can a program with known faults still be perceived as reliable by its users?

A

Yes. Program defects may be in rarely executed sections of the code so may never be encountered by users. Removing these does not affect the perceived reliability. Users adapt their behavior to avoid system features that may fail for them.

18
Q

System reliability is measured by…

A

counting the number of operational failures and, where appropriate, relating these to the demands made on the system and the time that the system has been operational

19
Q

Reliability metrics include:

A

Probability of failure on demand(POFOD), Rate of occurrence of failures (ROCOF), Availability(AVAIL)

20
Q

What is the Probability of failure on demand(POFOD)?

A

The probability that the system will fail when a service request is made. Useful when demands for service are intermittent and relatively infrequent.

21
Q

When the Rate of occurrence of failures (ROCOF) is relevant?

A

Relevant for systems where the system has to process a large number of similar requests in a short time.

22
Q

What is the reciprocal of ROCOF?

A

Mean time to failure (MTTF)

23
Q

Non-functional reliability requirements are…

A

specifications of the required reliability and availability of a system using one of the reliability metrics (POFOD, ROCOF or AVAIL)

24
Q

Functional reliability requirements specify…

A

the faults to be detected and the actions to be taken to ensure that these faults do not lead to system failures.

25
Q

Functional reliability requirements include:

A

Checking requirements that identify checks to ensure that incorrect data is detected before it leads to a failure.
Recovery requirements that are geared to help the system recover after a failure has occurred.
Redundancy requirements that specify redundant features of the system to be included.
Process requirements for reliability which specify the development process to be used may also be included.

26
Q

Fault tolerance is required where…

A

there are high availability requirements
or
where system failure costs are very high

27
Q

Fault tolerance means…

A

that the system can continue in operation in spite of software failure

28
Q

Fault-tolerant systems architectures based on:

A

redundancy and diversity

29
Q

The protection system is…

A

a specialized system that is associated with some other control system, which can take emergency action if a failure occurs, e.g. a system to stop a train if it passes a red light, or a system to shut down a reactor if temperature/pressure are too high

30
Q

Self-monitoring architecture is…

A

a multi-channel architectures where the system monitors its own operations and takes action if inconsistencies are detected. The same computation is carried out on each channel and the results are compared. If the results are identical and are produced at the same time, then it is assumed that the system is operating correctly. If the results are different, then a failure is assumed and a failure exception is raised.

31
Q

In self-monitoring architecture, hardware in each channel should be…

A

diverse so that common mode hardware failure will not lead to each channel producing the same results.

32
Q

In self-monitoring architecture, software in each channel should be…

A

diverse, otherwise the same software error would affect each channel

33
Q

When you should use several self-checking systems in parallel?

A

If high availability is required. This is the approach used in the Airbus family of aircraft for their flight control systems.

34
Q

N-version programming involves…

A

multiple versions of a software system to carry out computations at the same time. Approach derived from the notion of triple-modular redundancy, as used in hardware systems.

35
Q

Fill in the gaps:
In N-version programming, there should be an _ number of computers involved, typically _

A

odd, 3

36
Q

Fill in the gaps:
In N-version programming, the results are compared using a _ and the _ result is taken to be the correct result.

A

voting system, majority

37
Q

Hardware fault tolerance depends on …

A

triple-modular redundancy (TMR). There are three replicated identical components that receive the same input and whose outputs are compared. If one output is different, it is ignored and component failure is assumed.

38
Q

What are good programming practices that help reduce the incidence of program faults?

A

Limit the visibility of information in a program, Check all inputs for validity, Provide a handler for all exceptions, Minimize the use of error-prone constructs, Provide restart capabilities, Check array bounds, Include timeouts when calling external components, Name all constants that represent real-world values

39
Q

Error-prone constructs:

A

Unconditional branch (goto) statements
Floating-point numbers (inherently imprecise, which may lead to invalid comparisons)
Pointers
Dynamic memory allocation
Parallelism (can result in subtle timing errors because of unforeseen interaction between parallel processes)
Recursion (can cause memory overflow as the program stack fills up)
Interrupts (can cause a critical operation to be terminated and make a program difficult to understand)
Inheritance (code is not localized, which may result in unexpected behavior when changes are made and problems of understanding the code)
Aliasing (using more than 1 name to refer to the same state variable)
Unbounded arrays (may result in buffer overflow)
Default input processing (if the default action is to transfer control elsewhere in the program, incorrect or deliberately malicious input can then trigger a program failure)