Domain 7. Chapter 18 Flashcards
Chapter 18 Disaster Recovery Planning
- The Nature of Disaster
1.1 Natural Disasters
- Earthquakes
- Floods
- Fires
- Pandemics
- Other Natural Events
1.2 Human-Made Disasters
- Fires
- Acts of Terrorism
- Bombings/Explosions
- Power Outages Отключения питания
- Network, Utility, and Infrastructure Failures
- Hardware/Software Failures
- Strikes/Picketing
- Theft/Vandalism
- Natural disasters reflect the occasional fury of our habitat—violent occurrences that result from changes in the earth’s surface or atmosphere that are beyond human control.
During the BCP/DRP process, your assessment team should analyze all of your organization’s operating locations and gauge the impact that such events might have on your business.
If your business is geographically diverse, it is prudent to include local emergency response experts on your planning team.
- Understand System Resilience, High Availability, and Fault Tolerance
Понимание устойчивости системы, высокой доступности и отказоустойчивости
A primary goal of system resilience and fault tolerance is to eliminate single points of failure in critical business systems.
A single point of failure (SPOF) is any component that can cause an entire system to fail. If a database-dependent website includes multiple web servers all served by a single database server, the database server is a single point of failure.
System resilience refers to the ability of a system to maintain an acceptable level of service during an adverse event.
Fault tolerance is the ability of a system to suffer a fault but continue to operate. Fault tolerance is achieved by adding redundant components, such as additional disks within a properly configured RAID array or additional servers within a failover clustered configuration.
High availability is the use of redundant technology components to allow a system to quickly recover from a failure after experiencing a brief disruption. High availability is often achieved through the use of load balancing and failover servers. (серверы балансировки нагрузки и аварийного переключения.)
Technology professionals measure the objective and effectiveness of these controls by the percentage of the time that a system is available. For example, a fairly low availability threshold would be to specify that a system must be available 99.9 percent of the time (or “three nines” of availability). This means that the system may only experience 0.1 percent of downtime during whatever period is measured. If you apply this metric to a 30-day month of system operation, 99.9 percent availability would require less than 44 minutes of downtime. If you move to a 99.999 percent (or “five nines”) requirement, the system
2.1 Protecting Hard Drives
A RAID array includes two or more disks, and most RAID configurations will continue to operate even after one of the disks fails. Some of the common RAID configurations are as follows:
- RAID-0 This is also called striping. It uses two or more disks and improves the disk subsystem performance, but it does not provide fault tolerance.
- RAID-1 This is also called mirroring. It uses two disks, which both hold the same data. If one disk fails, the other disk includes the data so that a system can continue to operate after a single disk fails.
- RAID-5 This is also called striping with parity. (чередованием с четностью) It uses three or more disks with the equivalent of one disk holding parity information. This parity information allows the reconstruction of data through mathematical calculations if a single disk is lost. If any single disk fails, the RAID array will continue to operate, though it will be slower.
- RAID-6 This offers an alternative approach to disk striping with parity. It functions in the same manner as RAID-5 but stores parity information on two disks, protecting against the failure of two separate disks but requiring a minimum of four disks to implement.
- RAID-10 This is also known as RAID 1 + 0 or a stripe of mirrors, and it is configured as two or more mirrors (RAID-1), with each mirror configured in a striped (RAID-0) configuration. It uses at least four disks but can support more as long as an even number of disks are added. It will continue to operate even if multiple disks fail, as long as at least one drive in each mirror continues to function. However, if two drives in any of the mirrors failed, such as both drives in M1, the entire array would fail.
2.2 Protecting Servers
Fault tolerance can be added for critical servers with failover clusters. A failover cluster includes two or more servers, and if one of the servers fails, another server in the cluster can take over its load in an automatic process called failover. Failover clusters can include multiple servers (not just two), and they can also provide fault tolerance for multiple services or applications.
2.3 Protecting Power Sources
Fault tolerance can be added for power sources with a UPS, a generator, or both. In general, a UPS provides battery-supplied power for a short period of time, between 5 and 30 minutes, and a generator provides long-term power. The goal of a UPS is to provide power long enough to complete a logical shutdown of a system, or until a generator is powered on and providing stable power.
2.4 Trusted Recovery
Trusted recovery provides assurances that after a failure or crash, the system is just as secure as it was before the failure or crash occurred. Depending on the failure, the recovery may be automated or require manual intervention by an administrator.
Systems can be designed so that they fail in a fail-secure state or a fail-open state. A fail-secure system will default to a secure state in the event of a failure, blocking all access. A fail-open system will fail in an open state, granting all access.
Specifically, it defines four types of trusted recovery:
- Manual Recovery If a system fails, it does not fail in a secure state. Instead, an administrator is required to manually perform the actions necessary to implement a secured or trusted recovery after a failure or system crash.
- Automated Recovery The system is able to perform trusted recovery activities to restore itself against at least one type of failure.
- Automated Recovery without Undue Loss This is similar to automated recovery in that a system can restore itself against at least one type of failure. However, it includes mechanisms to ensure that specific objects are protected to prevent their loss. A method of automated recovery that protects against undue loss would include steps to restore data or other objects.
- Function Recovery Systems that support function recovery are able to automatically recover specific functions. This state ensures that the system is able to successfully complete the recovery for the functions, or that the system will be able to roll back the changes to return to a secure state.
2.5 Quality of Service
Quality of service (QoS) controls protect the availability of data networks under load. Many different factors contribute to the quality of the end-user experience, and QoS attempts to manage all of those factors to create an experience that meets business requirements.
Some of the factors contributing to QoS are as follows:
- Bandwidth The network capacity available to carry communications.
- Latency The time it takes a packet to travel from source to destination.
- Jitter The variation in latency between different packets.
- Packet Loss Some packets may be lost between source and destination, requiring retransmission.
- Interference Electrical noise, faulty equipment, and other factors may corrupt the contents of packets.
In addition to controlling these factors, QoS systems often prioritize certain traffic types that have low tolerance for interference and/or have high business requirements.
- Recovery Strategy
When a disaster interrupts your business, your disaster recovery plan should kick in nearly automatically and begin providing support for recovery operations. The disaster recovery plan should be designed so that the first employees on the scene can immediately begin the recovery effort in an organized fashion, even if members of the official DRP team have not yet arrived on site.
If your property insurance includes an actual cash value (ACV) clause, then your damaged property will be compensated based on the fair market value of the items on the date of loss, less all accumulated depreciation since the time of their purchase.
3.1 Business Unit and Functional Priorities
To recover your business operations with the greatest possible efficiency, you must engineer your disaster recovery plan so that those business units with the highest priority are recovered first. You must identify and prioritize critical business functions as well so that you can define which functions you want to restore after a disaster or failure and in what order. The business impact analysis (BIA) you developed during your business continuity work is an excellent resource when performing this task.
The output from this task should be a simple listing of business units in priority order.
However, a more detailed list, broken down into specific business processes listed in order of priority, would be a much more useful deliverable.
The final result should be a checklist of items in priority order, each with its own risk and cost assessment, and a corresponding set of recovery objectives and milestones. As discussed in Chapter 3, these include the mean time to repair (MTTR), maximum tolerable downtime (MTD), recovery time objective (RTO), and recovery point objective (RPO).
3.2 Crisis Management
If a disaster strikes your organization, panic is likely to set in. The best way to combat this is with an organized disaster recovery plan. The individuals in your business who are most likely to first notice an emergency situation (such as security guards and technical personnel) should be fully trained in disaster recovery procedures and know the proper notification procedures and immediate response mechanisms.
3.3 Emergency Communications
When a disaster strikes, it is important that the organization be able to communicate internally as well as with the outside world. A disaster of any significance is easily noticed, but if an organization is unable to keep the outside world informed of its recovery status, the public is apt to fear the worst and assume that the organization is unable to recover. It is also essential that the organization be able to communicate internally during a disaster so that employees know what is expected of them—whether they are to return to work or report to another location, for instance.
3.4 Workgroup Recovery
When designing a disaster recovery plan, it’s important to keep your goal in mind—the restoration of workgroups to the point that they can resume their activities in their usual work locations. It’s easy to get sidetracked and think of disaster recovery as purely an IT effort focused on restoring systems and processes to working order.
3.5 Alternate Processing Sites
One of the most important elements of the disaster recovery plan is the selection of alternate processing sites to be used when the primary sites are unavailable.
3.5.1 Cold sites are standby facilities large enough to handle the processing load of an organization and equipped with appropriate electrical and environmental support systems. The major advantage of a cold site is its relatively low cost—there’s no computing base to maintain and no monthly telecommunications bill when the site is idle.