Lesson 18: Explaining Disaster Recovery and High Availability Concepts Flashcards
Define availability
The percentage of time that the system is online, measured over a certain period, typically one year.
Describe high availability and its goal
Metric that defines how closely systems approach the goal of providing data availability 100 percent of the time while maintaining a high level of system performance.
Define Maximum Tolerable Downtime (MTD)
Longest period that a process can be inoperable without causing irrevocable business failure.
How is downtime calculated?
Calculated from the sum of scheduled service intervals (Agreed Service Time) plus unplanned outages over the period
For critical systems, what the the suggested availability?
99% (two nines) to 99.9999 (six nines)
Define Recovery time objective (RTO)
Maximum time allowed to restore a system after a failure event; maximum amount of time allowed to identify that there is a problem and then perform recovery.
Define Work Recovery Time (WRT)
Time spent performing reintegration and testing of a restored or upgraded system following an event.
What two factors are considered in Maximum tolerable downtime (MTD)?
- RTO - Recovery time objective
- WRT - Work recovery time (WRT)
Combined they must not exceed MTD
Define Recovery Point Objective (RPO)
Longest period that an organization can tolerate lost data being unrecoverable.
Define a fault
An event that causes a service/asset to become unavailable; servers, disk arrays, switches, routers, etc. can have faults
What is a KPI?
Key performance indicator - used to determine the reliability of each asset and assess whether goals for MTD, RTO, and RPO can be met.
Define Mean Time Between Failures (MTBF)
Metric for a device or component that predicts the expected time between failures
How is Mean Time Between Failures (MTBF) calculated?
Total operational time divided by the number of failures
Define Mean Time to Failure (MTTF)
Metric indicating average time a non-repairable component is expected to be in operation
What non-repairable components would be measure with mean time to failure (MTTF)?
HDDs, SSDs
How is Mean Time to Failure (MTTF) calculated?
Total operational time divided by the number of devices.
When is Mean Time to Failure (MTTF) used in comparison to Mean Time Between Failures (MTBF)?
A hard drive may be described with an MTTF, while a server, which could be repaired by replacing the hard drive, would be described with an MTBF.
Define Mean Time to Repair (MTTR)
Metric representing average time taken for a device or component to be repaired, replaced, or recover from a failure.
How is Mean Time to Repair (MTTR) calculated?
Total number hours of unplanned maintenance divided by the number of failure incidents.
How is Mean Time to Repair (MTTR) used in a recovery effort?
Used to estimate whether a recovery time objective (RTO) is achievable.
Define fault tolerance
A system that can experience failures in individual components and sub-systems and continue to provide the same (or nearly the same) level of service.
How is fault tolerance achieved?
By provisioning redundancy for critical components to eliminate single points of failure.