Chapter 5 - Availability Flashcards

1
Q

What is availability?

A

a property of software that it is there and ready to carry out its task when you need it to be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is availability different from reliability?

A

it builds on reliability by adding the notion of recovery and repair

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Availability general scenario: what are 5 possible values for “Source”

A

internal/external:

  • people
  • hardware
  • software
  • physical infrastructure
  • physical environment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Availability general scenario: what are 4 possible values for “Stimulus”

A
  • omission
  • crash
  • incorrect timing
  • incorrect response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Availability general scenario: what are 4 possible values for “Artifact”

A
  • system’s processors
  • communication channels
  • persistent storage
  • processes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Availability general scenario: what are 6 possible values for “Environment”

A
  • normal operation
  • startup
  • shutdown
  • repair mode
  • degraded operation
  • overloaded operation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Availability general scenario: what are 3 possible values for “Response”

A
  • prevent the fault from becoming a failure
  • detect fault
  • recover from fault
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Availability general scenario: what are 6 possible values for “Response Measure”

A
  • time interval when the system must be available
  • availability percentage
  • time to detect the fault
  • time to repair the fault
  • time interval in which system can be in degraded mode
  • the rate of a certain class of faults that the system prevents
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

2 system actions that are done in order to “detect the fault”

A
  • log the fault

- notify appropriate entities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

4 possible system actions that can be done in order to “recover from fault”

A
  • disable the source of events causing faults
  • be temporarily unavailable while the repair is being affected
  • fix/mask the fault or contain damage it causes
  • operate in degraded mode while repair in progress
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Definition of availability tactics?

A

they enable a system to endure faults so that services remain compliant with their specifications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The main goal of availability tactics?

A

to keep faults from becoming failures or at least bound the effects of the fault and make repair possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

9 tactics for detecting faults

A
  • ping/echo
  • monitor
  • heartbeat
  • timestamp
  • sanity checking
  • condition monitoring
  • voting
  • exception detection
  • self-test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Tactic for detecting faults: What is ping/echo?

A

an asynchronous request/response message pair exchanged between nodes, used to determine reachability and the round-trip delay through the associated network path

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Tactic for detecting faults: What is a monitor?

A

a component used to monitor the state of health of other parts of the system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Tactic for detecting faults: What is a heartbeat?

A

a periodic message exchange between a system monitor and a process being monitored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Tactic for detecting faults: What is a timestamp?

A

used to detect incorrect sequences of events, primarily in distributed message-passing systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Tactic for detecting faults: What is sanity checking?

A

checks the validity or reasonableness of a component’s operations or outputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Tactic for detecting faults: What is condition monitoring?

A

checking conditions in a process or device, or validating assumptions made during the design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Tactic for detecting faults: What is voting?

A

to check that replicated components are producing the same results

21
Q

Tactic for detecting faults: What is exception detection?

A

detection of a system condition that alters the normal flow of execution

22
Q

Tactic for detecting faults: What is self-test?

A

a procedure for a component to test itself for correct operation

23
Q

What are the 3 main categories of availability tactics?

A
  • detect faults
  • recover from faults
  • prevent faults
24
Q

What are the 2 sub-categories of availability tactics under recover from faults?

A
  • preparation and repair

- reintroduction

25
Q

10 tactics for preparation and repair for recovering from faults

A
  • active redundancy
  • passive redundancy
  • spare
  • exception handling
  • rollback
  • software upgrade
  • retry
  • ignore faulty behavior
  • degradation
  • reconfiguration
26
Q

Tactic for preparation and repair: what is active redundancy?

A

basically having hot backups

27
Q

Tactic for preparation and repair: what is passive redundancy?

A

the backups are not hot and get fed information during periodic updates

28
Q

Tactic for preparation and repair: what is spare?

A

a completely offline (or cold) version that undergoes a power-on-reset procedure when a fail-over occurs before it goes into service

29
Q

Tactic for preparation and repair: what is software update?

A

in-service upgrades to executable code images in a non-service-affecting manner

think iOS update!!

30
Q

Tactic for preparation and repair: what is retry?

A

trying an operation again may lead to success if the failure is transient

31
Q

Tactic for preparation and repair: what is reconfiguration?

A

reassigning responsibilities to the resources left functioning while maintaining as much functionality as possible

32
Q

What are the 4 tactics for reintroduction for recovery from fault

A
  • shadow
  • state resynchronization
  • escalating restart
  • non-stop forwarding
33
Q

Tactic for reintroduction: what is shadow?

A

operating a previously failed or in-service upgraded component in a “shadow mode” for a predefined time prior to reverting the component back to an active role

34
Q

Tactic for reintroduction: what is state resynchronization?

A

partner to active redundancy and passive redundancy where state information is sent from active to standby components

35
Q

Tactic for reintroduction: what is escalating restart?

A

recover from faults by varying the granularity of the component(s) restarted and minimizing the level of service affected

36
Q

Tactic for reintroduction: what is non-stop forwarding?

A

functionality is split into supervisory and data. If a supervisor fails, a router continues forwarding packets along known routes while protocol information is recovered and validated.

37
Q

5 tactics for preventing faults

A
  • removal from service
  • transactions
  • predictive model
  • exception prevention
  • increase competence set
38
Q

Tactic for preventing faults: what are transactions?

A

bundling state updates so that asynchronous messages exchanged between distributed components are atomic, consistent, isolated, and durable

39
Q

Tactic for preventing faults: what is a predictive model?

A

monitor the state of health of a process to ensure that the system is operating within nominal parameters

take some action if a dangerous state is near

THINK BANKERS ALGORITHM

40
Q

Tactic for preventing faults: what is exception prevention?

A

preventing system exceptions from occurring by masking a fault, or preventing it via smart pointers, abstract data types, or wrappers.

41
Q

Tactic for preventing faults: what does it mean to increase

the compentence set?

A

designing a component to handle more cases/faults as part of its normal operation.

42
Q

3 important things for availability in terms of allocation of responsibilities

A
  • determining system responsibilities that need to be highly available
  • allocate responsibilities for detecting 4 possible stimuli
  • allocate responsibilities for performing some combination of the 6 possible responses
43
Q

4 important things for availability in terms of the coordination model

A
  • ensure that coordination mechanisms can detect the 4 possible stimuli
  • ensure the coordination mechanisms enable the 6 responses
  • ensure the coordination model supports the replacement of any of the 4 artifacts
  • determine if the coordination model will work under any of the 6 environments
44
Q

2 important things for availability in terms of the data model

A
  • determine which data abstractions could cause a stimulus

- ensure that the 4 repair from recovery actions can be used on the data abstractions

45
Q

2 important things for availability in terms of the mapping among architectural elements

A
  • determine which artifacts may produce a stimuli

- ensure that mapping/re-mapping of architectural elements is flexible enough to permit recovery from a fault

46
Q

5 important things for availability in terms of resource management

A
  • determine what critical resources are necessary to continue operating in the presence of one of the 4 stimuli
  • ensure there are sufficient resources after a fault to perform any of the 6 responses
  • determine availability time for critical resources
  • specify time intervals in which critical resources must be available in any of the system environments
47
Q

1 important thing for availability in terms of binding time

A

-ensure availability strategy is sufficient to cover introduced faults caused by late bindings

48
Q

3 important things for availability in terms of choice of technology

A
  • determine if available technologies can detect faults, recover, and reintroduce failed components
  • determine what technologies can help the response to a fault
  • determine availability characteristics of chosen technologies themselves