Chapter 5 - Availability Flashcards

1
Q

What is availability?

A

a property of software that it is there and ready to carry out its task when you need it to be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is availability different from reliability?

A

it builds on reliability by adding the notion of recovery and repair

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Availability general scenario: what are 5 possible values for “Source”

A

internal/external:

  • people
  • hardware
  • software
  • physical infrastructure
  • physical environment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Availability general scenario: what are 4 possible values for “Stimulus”

A
  • omission
  • crash
  • incorrect timing
  • incorrect response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Availability general scenario: what are 4 possible values for “Artifact”

A
  • system’s processors
  • communication channels
  • persistent storage
  • processes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Availability general scenario: what are 6 possible values for “Environment”

A
  • normal operation
  • startup
  • shutdown
  • repair mode
  • degraded operation
  • overloaded operation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Availability general scenario: what are 3 possible values for “Response”

A
  • prevent the fault from becoming a failure
  • detect fault
  • recover from fault
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Availability general scenario: what are 6 possible values for “Response Measure”

A
  • time interval when the system must be available
  • availability percentage
  • time to detect the fault
  • time to repair the fault
  • time interval in which system can be in degraded mode
  • the rate of a certain class of faults that the system prevents
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

2 system actions that are done in order to “detect the fault”

A
  • log the fault

- notify appropriate entities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

4 possible system actions that can be done in order to “recover from fault”

A
  • disable the source of events causing faults
  • be temporarily unavailable while the repair is being affected
  • fix/mask the fault or contain damage it causes
  • operate in degraded mode while repair in progress
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Definition of availability tactics?

A

they enable a system to endure faults so that services remain compliant with their specifications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The main goal of availability tactics?

A

to keep faults from becoming failures or at least bound the effects of the fault and make repair possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

9 tactics for detecting faults

A
  • ping/echo
  • monitor
  • heartbeat
  • timestamp
  • sanity checking
  • condition monitoring
  • voting
  • exception detection
  • self-test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Tactic for detecting faults: What is ping/echo?

A

an asynchronous request/response message pair exchanged between nodes, used to determine reachability and the round-trip delay through the associated network path

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Tactic for detecting faults: What is a monitor?

A

a component used to monitor the state of health of other parts of the system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Tactic for detecting faults: What is a heartbeat?

A

a periodic message exchange between a system monitor and a process being monitored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Tactic for detecting faults: What is a timestamp?

A

used to detect incorrect sequences of events, primarily in distributed message-passing systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Tactic for detecting faults: What is sanity checking?

A

checks the validity or reasonableness of a component’s operations or outputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Tactic for detecting faults: What is condition monitoring?

A

checking conditions in a process or device, or validating assumptions made during the design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Tactic for detecting faults: What is voting?

A

to check that replicated components are producing the same results

21
Q

Tactic for detecting faults: What is exception detection?

A

detection of a system condition that alters the normal flow of execution

22
Q

Tactic for detecting faults: What is self-test?

A

a procedure for a component to test itself for correct operation

23
Q

What are the 3 main categories of availability tactics?

A
  • detect faults
  • recover from faults
  • prevent faults
24
Q

What are the 2 sub-categories of availability tactics under recover from faults?

A
  • preparation and repair

- reintroduction

25
10 tactics for preparation and repair for recovering from faults
- active redundancy - passive redundancy - spare - exception handling - rollback - software upgrade - retry - ignore faulty behavior - degradation - reconfiguration
26
Tactic for preparation and repair: what is active redundancy?
basically having hot backups
27
Tactic for preparation and repair: what is passive redundancy?
the backups are not hot and get fed information during periodic updates
28
Tactic for preparation and repair: what is spare?
a completely offline (or cold) version that undergoes a power-on-reset procedure when a fail-over occurs before it goes into service
29
Tactic for preparation and repair: what is software update?
in-service upgrades to executable code images in a non-service-affecting manner think iOS update!!
30
Tactic for preparation and repair: what is retry?
trying an operation again may lead to success if the failure is transient
31
Tactic for preparation and repair: what is reconfiguration?
reassigning responsibilities to the resources left functioning while maintaining as much functionality as possible
32
What are the 4 tactics for reintroduction for recovery from fault
- shadow - state resynchronization - escalating restart - non-stop forwarding
33
Tactic for reintroduction: what is shadow?
operating a previously failed or in-service upgraded component in a “shadow mode” for a predefined time prior to reverting the component back to an active role
34
Tactic for reintroduction: what is state resynchronization?
partner to active redundancy and passive redundancy where state information is sent from active to standby components
35
Tactic for reintroduction: what is escalating restart?
recover from faults by varying the granularity of the component(s) restarted and minimizing the level of service affected
36
Tactic for reintroduction: what is non-stop forwarding?
functionality is split into supervisory and data. If a supervisor fails, a router continues forwarding packets along known routes while protocol information is recovered and validated.
37
5 tactics for preventing faults
- removal from service - transactions - predictive model - exception prevention - increase competence set
38
Tactic for preventing faults: what are transactions?
bundling state updates so that asynchronous messages exchanged between distributed components are atomic, consistent, isolated, and durable
39
Tactic for preventing faults: what is a predictive model?
monitor the state of health of a process to ensure that the system is operating within nominal parameters take some action if a dangerous state is near THINK BANKERS ALGORITHM
40
Tactic for preventing faults: what is exception prevention?
preventing system exceptions from occurring by masking a fault, or preventing it via smart pointers, abstract data types, or wrappers.
41
Tactic for preventing faults: what does it mean to increase | the compentence set?
designing a component to handle more cases/faults as part of its normal operation.
42
3 important things for availability in terms of allocation of responsibilities
- determining system responsibilities that need to be highly available - allocate responsibilities for detecting 4 possible stimuli - allocate responsibilities for performing some combination of the 6 possible responses
43
4 important things for availability in terms of the coordination model
- ensure that coordination mechanisms can detect the 4 possible stimuli - ensure the coordination mechanisms enable the 6 responses - ensure the coordination model supports the replacement of any of the 4 artifacts - determine if the coordination model will work under any of the 6 environments
44
2 important things for availability in terms of the data model
- determine which data abstractions could cause a stimulus | - ensure that the 4 repair from recovery actions can be used on the data abstractions
45
2 important things for availability in terms of the mapping among architectural elements
- determine which artifacts may produce a stimuli | - ensure that mapping/re-mapping of architectural elements is flexible enough to permit recovery from a fault
46
5 important things for availability in terms of resource management
- determine what critical resources are necessary to continue operating in the presence of one of the 4 stimuli - ensure there are sufficient resources after a fault to perform any of the 6 responses - determine availability time for critical resources - specify time intervals in which critical resources must be available in any of the system environments
47
1 important thing for availability in terms of binding time
-ensure availability strategy is sufficient to cover introduced faults caused by late bindings
48
3 important things for availability in terms of choice of technology
- determine if available technologies can detect faults, recover, and reintroduce failed components - determine what technologies can help the response to a fault - determine availability characteristics of chosen technologies themselves