Chapter 5 - Availability Flashcards
What is availability?
a property of software that it is there and ready to carry out its task when you need it to be
How is availability different from reliability?
it builds on reliability by adding the notion of recovery and repair
Availability general scenario: what are 5 possible values for “Source”
internal/external:
- people
- hardware
- software
- physical infrastructure
- physical environment
Availability general scenario: what are 4 possible values for “Stimulus”
- omission
- crash
- incorrect timing
- incorrect response
Availability general scenario: what are 4 possible values for “Artifact”
- system’s processors
- communication channels
- persistent storage
- processes
Availability general scenario: what are 6 possible values for “Environment”
- normal operation
- startup
- shutdown
- repair mode
- degraded operation
- overloaded operation
Availability general scenario: what are 3 possible values for “Response”
- prevent the fault from becoming a failure
- detect fault
- recover from fault
Availability general scenario: what are 6 possible values for “Response Measure”
- time interval when the system must be available
- availability percentage
- time to detect the fault
- time to repair the fault
- time interval in which system can be in degraded mode
- the rate of a certain class of faults that the system prevents
2 system actions that are done in order to “detect the fault”
- log the fault
- notify appropriate entities
4 possible system actions that can be done in order to “recover from fault”
- disable the source of events causing faults
- be temporarily unavailable while the repair is being affected
- fix/mask the fault or contain damage it causes
- operate in degraded mode while repair in progress
Definition of availability tactics?
they enable a system to endure faults so that services remain compliant with their specifications
The main goal of availability tactics?
to keep faults from becoming failures or at least bound the effects of the fault and make repair possible
9 tactics for detecting faults
- ping/echo
- monitor
- heartbeat
- timestamp
- sanity checking
- condition monitoring
- voting
- exception detection
- self-test
Tactic for detecting faults: What is ping/echo?
an asynchronous request/response message pair exchanged between nodes, used to determine reachability and the round-trip delay through the associated network path
Tactic for detecting faults: What is a monitor?
a component used to monitor the state of health of other parts of the system
Tactic for detecting faults: What is a heartbeat?
a periodic message exchange between a system monitor and a process being monitored
Tactic for detecting faults: What is a timestamp?
used to detect incorrect sequences of events, primarily in distributed message-passing systems
Tactic for detecting faults: What is sanity checking?
checks the validity or reasonableness of a component’s operations or outputs
Tactic for detecting faults: What is condition monitoring?
checking conditions in a process or device, or validating assumptions made during the design
Tactic for detecting faults: What is voting?
to check that replicated components are producing the same results
Tactic for detecting faults: What is exception detection?
detection of a system condition that alters the normal flow of execution
Tactic for detecting faults: What is self-test?
a procedure for a component to test itself for correct operation
What are the 3 main categories of availability tactics?
- detect faults
- recover from faults
- prevent faults
What are the 2 sub-categories of availability tactics under recover from faults?
- preparation and repair
- reintroduction
10 tactics for preparation and repair for recovering from faults
- active redundancy
- passive redundancy
- spare
- exception handling
- rollback
- software upgrade
- retry
- ignore faulty behavior
- degradation
- reconfiguration
Tactic for preparation and repair: what is active redundancy?
basically having hot backups
Tactic for preparation and repair: what is passive redundancy?
the backups are not hot and get fed information during periodic updates
Tactic for preparation and repair: what is spare?
a completely offline (or cold) version that undergoes a power-on-reset procedure when a fail-over occurs before it goes into service
Tactic for preparation and repair: what is software update?
in-service upgrades to executable code images in a non-service-affecting manner
think iOS update!!
Tactic for preparation and repair: what is retry?
trying an operation again may lead to success if the failure is transient
Tactic for preparation and repair: what is reconfiguration?
reassigning responsibilities to the resources left functioning while maintaining as much functionality as possible
What are the 4 tactics for reintroduction for recovery from fault
- shadow
- state resynchronization
- escalating restart
- non-stop forwarding
Tactic for reintroduction: what is shadow?
operating a previously failed or in-service upgraded component in a “shadow mode” for a predefined time prior to reverting the component back to an active role
Tactic for reintroduction: what is state resynchronization?
partner to active redundancy and passive redundancy where state information is sent from active to standby components
Tactic for reintroduction: what is escalating restart?
recover from faults by varying the granularity of the component(s) restarted and minimizing the level of service affected
Tactic for reintroduction: what is non-stop forwarding?
functionality is split into supervisory and data. If a supervisor fails, a router continues forwarding packets along known routes while protocol information is recovered and validated.
5 tactics for preventing faults
- removal from service
- transactions
- predictive model
- exception prevention
- increase competence set
Tactic for preventing faults: what are transactions?
bundling state updates so that asynchronous messages exchanged between distributed components are atomic, consistent, isolated, and durable
Tactic for preventing faults: what is a predictive model?
monitor the state of health of a process to ensure that the system is operating within nominal parameters
take some action if a dangerous state is near
THINK BANKERS ALGORITHM
Tactic for preventing faults: what is exception prevention?
preventing system exceptions from occurring by masking a fault, or preventing it via smart pointers, abstract data types, or wrappers.
Tactic for preventing faults: what does it mean to increase
the compentence set?
designing a component to handle more cases/faults as part of its normal operation.
3 important things for availability in terms of allocation of responsibilities
- determining system responsibilities that need to be highly available
- allocate responsibilities for detecting 4 possible stimuli
- allocate responsibilities for performing some combination of the 6 possible responses
4 important things for availability in terms of the coordination model
- ensure that coordination mechanisms can detect the 4 possible stimuli
- ensure the coordination mechanisms enable the 6 responses
- ensure the coordination model supports the replacement of any of the 4 artifacts
- determine if the coordination model will work under any of the 6 environments
2 important things for availability in terms of the data model
- determine which data abstractions could cause a stimulus
- ensure that the 4 repair from recovery actions can be used on the data abstractions
2 important things for availability in terms of the mapping among architectural elements
- determine which artifacts may produce a stimuli
- ensure that mapping/re-mapping of architectural elements is flexible enough to permit recovery from a fault
5 important things for availability in terms of resource management
- determine what critical resources are necessary to continue operating in the presence of one of the 4 stimuli
- ensure there are sufficient resources after a fault to perform any of the 6 responses
- determine availability time for critical resources
- specify time intervals in which critical resources must be available in any of the system environments
1 important thing for availability in terms of binding time
-ensure availability strategy is sufficient to cover introduced faults caused by late bindings
3 important things for availability in terms of choice of technology
- determine if available technologies can detect faults, recover, and reintroduce failed components
- determine what technologies can help the response to a fault
- determine availability characteristics of chosen technologies themselves