Domain 7. Chapter 18 Flashcards by Elena Demina

Chapter 18 Disaster Recovery Planning

How well did you know this?

Not at all

Perfectly

The Nature of Disaster
1.1 Natural Disasters
- Earthquakes
- Floods
- Fires
- Pandemics
- Other Natural Events
1.2 Human-Made Disasters
- Fires
- Acts of Terrorism
- Bombings/Explosions
- Power Outages Отключения питания
- Network, Utility, and Infrastructure Failures
- Hardware/Software Failures
- Strikes/Picketing
- Theft/Vandalism

How well did you know this?

Not at all

Perfectly

Natural disasters reflect the occasional fury of our habitat—violent occurrences that result from changes in the earth’s surface or atmosphere that are beyond human control.
During the BCP/DRP process, your assessment team should analyze all of your organization’s operating locations and gauge the impact that such events might have on your business.
If your business is geographically diverse, it is prudent to include local emergency response experts on your planning team.

How well did you know this?

Not at all

Perfectly

Understand System Resilience, High Availability, and Fault Tolerance
Понимание устойчивости системы, высокой доступности и отказоустойчивости
A primary goal of system resilience and fault tolerance is to eliminate single points of failure in critical business systems.

A single point of failure (SPOF) is any component that can cause an entire system to fail. If a database-dependent website includes multiple web servers all served by a single database server, the database server is a single point of failure.

System resilience refers to the ability of a system to maintain an acceptable level of service during an adverse event.

Fault tolerance is the ability of a system to suffer a fault but continue to operate. Fault tolerance is achieved by adding redundant components, such as additional disks within a properly configured RAID array or additional servers within a failover clustered configuration.

High availability is the use of redundant technology components to allow a system to quickly recover from a failure after experiencing a brief disruption. High availability is often achieved through the use of load balancing and failover servers. (серверы балансировки нагрузки и аварийного переключения.)

Technology professionals measure the objective and effectiveness of these controls by the percentage of the time that a system is available. For example, a fairly low availability threshold would be to specify that a system must be available 99.9 percent of the time (or “three nines” of availability). This means that the system may only experience 0.1 percent of downtime during whatever period is measured. If you apply this metric to a 30-day month of system operation, 99.9 percent availability would require less than 44 minutes of downtime. If you move to a 99.999 percent (or “five nines”) requirement, the system

How well did you know this?

Not at all

Perfectly

2.1 Protecting Hard Drives
A RAID array includes two or more disks, and most RAID configurations will continue to operate even after one of the disks fails. Some of the common RAID configurations are as follows:
- RAID-0 This is also called striping. It uses two or more disks and improves the disk subsystem performance, but it does not provide fault tolerance.
- RAID-1 This is also called mirroring. It uses two disks, which both hold the same data. If one disk fails, the other disk includes the data so that a system can continue to operate after a single disk fails.
- RAID-5 This is also called striping with parity. (чередованием с четностью) It uses three or more disks with the equivalent of one disk holding parity information. This parity information allows the reconstruction of data through mathematical calculations if a single disk is lost. If any single disk fails, the RAID array will continue to operate, though it will be slower.
- RAID-6 This offers an alternative approach to disk striping with parity. It functions in the same manner as RAID-5 but stores parity information on two disks, protecting against the failure of two separate disks but requiring a minimum of four disks to implement.
- RAID-10 This is also known as RAID 1 + 0 or a stripe of mirrors, and it is configured as two or more mirrors (RAID-1), with each mirror configured in a striped (RAID-0) configuration. It uses at least four disks but can support more as long as an even number of disks are added. It will continue to operate even if multiple disks fail, as long as at least one drive in each mirror continues to function. However, if two drives in any of the mirrors failed, such as both drives in M1, the entire array would fail.

How well did you know this?

Not at all

Perfectly

2.2 Protecting Servers
Fault tolerance can be added for critical servers with failover clusters. A failover cluster includes two or more servers, and if one of the servers fails, another server in the cluster can take over its load in an automatic process called failover. Failover clusters can include multiple servers (not just two), and they can also provide fault tolerance for multiple services or applications.

How well did you know this?

Not at all

Perfectly

2.3 Protecting Power Sources
Fault tolerance can be added for power sources with a UPS, a generator, or both. In general, a UPS provides battery-supplied power for a short period of time, between 5 and 30 minutes, and a generator provides long-term power. The goal of a UPS is to provide power long enough to complete a logical shutdown of a system, or until a generator is powered on and providing stable power.

How well did you know this?

Not at all

Perfectly

2.4 Trusted Recovery
Trusted recovery provides assurances that after a failure or crash, the system is just as secure as it was before the failure or crash occurred. Depending on the failure, the recovery may be automated or require manual intervention by an administrator.
Systems can be designed so that they fail in a fail-secure state or a fail-open state. A fail-secure system will default to a secure state in the event of a failure, blocking all access. A fail-open system will fail in an open state, granting all access.
Specifically, it defines four types of trusted recovery:

Manual Recovery If a system fails, it does not fail in a secure state. Instead, an administrator is required to manually perform the actions necessary to implement a secured or trusted recovery after a failure or system crash.
Automated Recovery The system is able to perform trusted recovery activities to restore itself against at least one type of failure.
Automated Recovery without Undue Loss This is similar to automated recovery in that a system can restore itself against at least one type of failure. However, it includes mechanisms to ensure that specific objects are protected to prevent their loss. A method of automated recovery that protects against undue loss would include steps to restore data or other objects.
Function Recovery Systems that support function recovery are able to automatically recover specific functions. This state ensures that the system is able to successfully complete the recovery for the functions, or that the system will be able to roll back the changes to return to a secure state.

How well did you know this?

Not at all

Perfectly

2.5 Quality of Service
Quality of service (QoS) controls protect the availability of data networks under load. Many different factors contribute to the quality of the end-user experience, and QoS attempts to manage all of those factors to create an experience that meets business requirements.

Some of the factors contributing to QoS are as follows:
- Bandwidth The network capacity available to carry communications.
- Latency The time it takes a packet to travel from source to destination.
- Jitter The variation in latency between different packets.
- Packet Loss Some packets may be lost between source and destination, requiring retransmission.
- Interference Electrical noise, faulty equipment, and other factors may corrupt the contents of packets.

In addition to controlling these factors, QoS systems often prioritize certain traffic types that have low tolerance for interference and/or have high business requirements.

How well did you know this?

Not at all

Perfectly

Recovery Strategy
When a disaster interrupts your business, your disaster recovery plan should kick in nearly automatically and begin providing support for recovery operations. The disaster recovery plan should be designed so that the first employees on the scene can immediately begin the recovery effort in an organized fashion, even if members of the official DRP team have not yet arrived on site.

If your property insurance includes an actual cash value (ACV) clause, then your damaged property will be compensated based on the fair market value of the items on the date of loss, less all accumulated depreciation since the time of their purchase.

How well did you know this?

Not at all

Perfectly

3.1 Business Unit and Functional Priorities
To recover your business operations with the greatest possible efficiency, you must engineer your disaster recovery plan so that those business units with the highest priority are recovered first. You must identify and prioritize critical business functions as well so that you can define which functions you want to restore after a disaster or failure and in what order. The business impact analysis (BIA) you developed during your business continuity work is an excellent resource when performing this task.
The output from this task should be a simple listing of business units in priority order.
However, a more detailed list, broken down into specific business processes listed in order of priority, would be a much more useful deliverable.
The final result should be a checklist of items in priority order, each with its own risk and cost assessment, and a corresponding set of recovery objectives and milestones. As discussed in Chapter 3, these include the mean time to repair (MTTR), maximum tolerable downtime (MTD), recovery time objective (RTO), and recovery point objective (RPO).

How well did you know this?

Not at all

Perfectly

3.2 Crisis Management
If a disaster strikes your organization, panic is likely to set in. The best way to combat this is with an organized disaster recovery plan. The individuals in your business who are most likely to first notice an emergency situation (such as security guards and technical personnel) should be fully trained in disaster recovery procedures and know the proper notification procedures and immediate response mechanisms.

How well did you know this?

Not at all

Perfectly

3.3 Emergency Communications
When a disaster strikes, it is important that the organization be able to communicate internally as well as with the outside world. A disaster of any significance is easily noticed, but if an organization is unable to keep the outside world informed of its recovery status, the public is apt to fear the worst and assume that the organization is unable to recover. It is also essential that the organization be able to communicate internally during a disaster so that employees know what is expected of them—whether they are to return to work or report to another location, for instance.

How well did you know this?

Not at all

Perfectly

3.4 Workgroup Recovery
When designing a disaster recovery plan, it’s important to keep your goal in mind—the restoration of workgroups to the point that they can resume their activities in their usual work locations. It’s easy to get sidetracked and think of disaster recovery as purely an IT effort focused on restoring systems and processes to working order.

How well did you know this?

Not at all

Perfectly

3.5 Alternate Processing Sites
One of the most important elements of the disaster recovery plan is the selection of alternate processing sites to be used when the primary sites are unavailable.

How well did you know this?

Not at all

Perfectly

3.5.1 Cold sites are standby facilities large enough to handle the processing load of an organization and equipped with appropriate electrical and environmental support systems. The major advantage of a cold site is its relatively low cost—there’s no computing base to maintain and no monthly telecommunications bill when the site is idle.

How well did you know this?

Not at all

Perfectly

3.5.2 A hot site is the exact opposite of the cold site. In this configuration, a backup facility is maintained in constant working order, with a full complement of servers, workstations, and communications links ready to assume primary operations responsibilities. The servers and workstations are all preconfigured and loaded with appropriate operating system and application software.

If it’s not the case, disaster recovery managers have three options to activate the hot site:

If there is sufficient time before the primary site must be shut down, they can force replication between the two sites right before the transition of operational control.
If replication is impossible, managers may carry backup tapes of the transaction logs from the primary site to the hot site and manually reapply any transactions that took place since the last replication.
If there are no available backups and it isn’t possible to force replication, the disaster recovery team may simply accept the loss of some portion of the data. This should only be done when the loss is within the organization’s recovery point objective (RPO).

3.5.3 Warm Sites
Warm sites occupy the middle ground between hot and cold sites for disaster recovery specialists. They always contain the equipment and data circuits necessary to rapidly establish operations. As with hot sites, this equipment is usually preconfigured and ready to run appropriate applications to support an organization’s operations. Unlike hot sites, however, warm sites do not typically contain copies of the client’s data. The main requirement in bringing a warm site to full operational status is the transportation of appropriate backup media to the site and restoration of critical data on the standby servers.

3.5.4 Mobile Sites
Mobile sites are non-mainstream alternatives to traditional recovery sites. They typically consist of self-contained trailers or other easily relocated units. These sites include all the environmental control systems necessary to maintain a safe computing environment. Larger corporations sometimes maintain these sites on a “fly-away” basis, ready to deploy them to any operating location around the world via air, rail, sea, or surface transportation. Smaller firms might contract with a mobile site vendor in their local area to provide these services on an as-needed basis.

3.5.5 Cloud Computing
Many organizations now turn to cloud computing as their preferred disaster recovery option. Infrastructure-as-a-service (IaaS) providers, such as Amazon Web Services (AWS), Microsoft Azure, and Google Compute Engine, offer on-demand service at low cost.

3.5.6 Mutual Assistance Agreements Соглашения о взаимной помощи (MAA)
Mutual assistance agreements (MAAs), also called reciprocal agreements, are popular in disaster recovery literature but are rarely implemented in real-world practice. Under an MAA, two organizations pledge обязуются to assist each other in the event of a disaster by sharing computing facilities or other technological resources.

3.6 Database Recovery
3.6.1 Electronic Vaulting
In an electronic vaulting scenario, database backups are moved to a remote site using bulk transfers. В сценарии электронного хранилища резервные копии базы данных перемещаются на удаленный сайт с помощью массовой передачи. The remote location may be a dedicated alternative recovery site (such as a hot site) or simply an offsite location managed within the company or by a contractor for the purpose of maintaining backup data.
If you use electronic vaulting, remember that there may be a significant delay between the time you declare a disaster and the time your database is ready for operation with current data. If you decide to activate a recovery site, technicians will need to retrieve the appropriate backups from the electronic vault and apply them to the soon-to-be production servers at the recovery site.

3.6.2 Remote Journaling
With remote journaling, data transfers are performed in a more expeditious manner. Data transfers still occur in a bulk transfer mode, but they occur on a more frequent basis, usually once every hour and sometimes more frequently. Unlike electronic vaulting scenarios, where entire database backup files are transferred, remote journaling setups transfer copies of the database transaction logs containing the transactions that occurred since the previous bulk transfer.

3.6.3 Remote Mirroring
Remote mirroring is the most advanced database backup solution. Not surprisingly, it’s also the most expensive! Remote mirroring goes beyond the technology used by remote journaling and electronic vaulting; with remote mirroring, a live database server is maintained at the backup site. The remote server receives copies of the database modifications at the same time they are applied to the production server at the primary site. Therefore, the mirrored server is ready to take over an operational role at a moment’s notice.

Recovery Plan Development
Once you’ve established your business unit priorities and have a good idea of the appropriate alternative recovery sites for your organization, it’s time to put pen to paper and begin drafting a true disaster recovery plan.

The following list includes various types of documents worth considering:
- Executive summary providing a high-level overview of the plan
Managers and public relations personnel will have a simple document that walks them through a high-level view of the coordinated symphony that is an active disaster recovery effort without requiring interpretation from team members busy with tasks directly related to that effort.

Department-specific plans
Personnel who need to refresh themselves on the disaster recovery procedures that affect various parts of the organization will be able to refer to their department-specific plans.
Technical guides for IT personnel responsible for implementing and maintaining critical backup systems
Critical disaster recovery team members will have checklists to help guide their actions amid the chaotic atmosphere of a disaster.
Checklists for individuals on the disaster recovery team
Critical disaster recovery team members will have checklists to help guide their actions amid the chaotic atmosphere of a disaster.
Full copies of the plan for critical disaster recovery team members

4.1 Emergency Response
A disaster recovery plan should contain simple yet comprehensive instructions for essential personnel to follow immediately upon recognizing that a disaster is in progress or is imminent.
Emergency-response plans are often put together in the form of checklists provided to responders. When designing such checklists, keep one essential design principle in mind: arrange the checklist tasks in order of priority, with the most important task first!
Among these essential tasks is the formal declaration of a disaster. The response plan should include clear criteria for activation of the disaster recovery plan, define who has the authority to declare a disaster, and then discuss notification procedures, as discussed in the next section.

4.2 Personnel and Communications
A disaster recovery plan should also contain a list of personnel to contact in the event of a disaster. Usually, this includes key members of the DRP team as well as personnel who execute critical disaster recovery tasks throughout the organization.
This response checklist should include alternate means of contact (that is, pager numbers, mobile phone numbers, and so on) as well as backup contacts for each role should the primary contact be incommunicado or unable to reach the recovery site for one reason or another.

Many firms organize their notification checklists in a “telephone tree” style: each member of the tree contacts the person below them, spreading the notification burden among members of the team instead of relying on one person to make lots of telephone calls.

4.3 Assessment
When the disaster recovery team arrives on site, one of their first tasks is to assess the situation. This normally occurs in a rolling fashion, with the first responders performing a simple assessment to triage activity and get the disaster response under way.
Когда группа аварийного восстановления прибывает на место, одной из их первых задач является оценка ситуации. Обычно это происходит поочередно, когда службы экстренного реагирования проводят простую оценку для сортировки действий и начала реагирования на стихийное бедствие.

4.4 Backups and Off-site Storage
Backups play an important role in the disaster recovery plan.
Many system administrators are already familiar with various types of backups, so you’ll benefit by bringing one or more individuals with specific technical expertise in this area onto the BCP/DRP team to provide expert guidance. There are three main types of backups:
- Full Backups
- Incremental Backups Incremental backups store only those files that have been modified since the time of the most recent full or incremental backup.
- Differential Backups Differential backups store all files that have been modified since the time of the most recent full backup.

The most important difference between incremental and differential backups is the time needed to restore data in the event of an emergency. If you use a combination of full and differential backups, you will need to restore only two backups—the most recent full backup and the most recent differential backup. On the other hand, if your strategy combines full backups with incremental backups, you will need to restore the most recent full backup as well as all incremental backups performed since that full backup. The trade-off is the time required to create the backups—differential backups don’t take as long to restore, but they take longer to create than incremental ones.

In case of system failure, many companies use one of two common methods to restore data from backups. In the first situation, they run a full backup on Monday night and then run differential backups every other night of the week. If a failure occurs Saturday morning, they restore Monday’s full backup and then restore only Friday’s differential backup. In the second situation, they run a full backup on Monday night and run incremental backups every other night of the week. If a failure occurs Saturday morning, they restore Monday’s full backup and then restore each incremental backup in original chronological order (that is, Wednesday’s, then Friday’s, and so on).

4.5 Disk-to-Disk Backup
Many enterprises now use disk-to-disk (D2D) backup solutions for some portion of their disaster recovery strategy.

Many backup technologies are designed around the tape paradigm. Virtual tape libraries (VTL) support the use of disks with this model by using software to make disk storage appear as tapes to backup software.

4.6 Backup Best Practices
Murphy’s law dictates that a server never crashes immediately after a successful backup. Instead, it is always just before the next backup begins. To avoid the problem with periods, you may deploy some form of real-time continuous backup, such as RAID, clustering, or server mirroring.

4.8 Software Escrow Arrangements
A software escrow arrangement is a unique tool used to protect a company against the failure of a software developer to provide adequate support for its products or against the possibility that the developer will go out of business and no technical support will be available for the product.

4.7 Tape Rotation
There are several commonly used tape rotation strategies for backups: the Grandfather-Father-Son (GFS) strategy, the Tower of Hanoi strategy, and the Six Cartridge Weekly Backup strategy. These strategies can be fairly complex, especially with large tape sets. They can be implemented manually using a pencil and a calendar or automatically by using either commercial backup software or a fully automated hierarchical storage management (HSM) system. An HSM system is an automated robotic backup jukebox consisting of 32 or 64 optical or tape backup devices. All the drive elements within an HSM system are configured as a single drive array (a bit like RAID).

4.8 Utilities
As discussed in previous sections of this chapter, your organization is reliant on several utilities to provide critical elements of your infrastructure—electric power, water, natural gas, sewer service, and so on. Your disaster recovery plan should contain contact information and procedures to troubleshoot these services if problems arise during a disaster.

Logistics and Supplies
The logistical problems surrounding a disaster recovery operation are immense. You will suddenly face the problem of moving large numbers of people, equipment, and supplies to alternate recovery sites. It’s also possible that the people will be living at those sites for an extended period of time and that the disaster recovery team will be responsible for providing them with food, water, shelter, and appropriate facilities. Your disaster recovery plan should contain provisions for this type of operation if it falls within the scope of your expected operational needs.

4.9 Logistics and Supplies
The logistical problems surrounding a disaster recovery operation are immense. You will suddenly face the problem of moving large numbers of people, equipment, and supplies to alternate recovery sites. It’s also possible that the people will be living at those sites for an extended period of time and that the disaster recovery team will be responsible for providing them with food, water, shelter, and appropriate facilities. Your disaster recovery plan should contain provisions for this type of operation if it falls within the scope of your expected operational needs.

4.10 Recovery vs. Restoration
It is sometimes useful to separate disaster recovery tasks from disaster restoration tasks. This is especially true when a recovery effort is expected to take a significant amount of time. A disaster recovery team команда аварийного восстановления may be assigned to implement and maintain operations at the recovery site, and a salvage team спасательная команда is assigned to restore the primary site to operational capacity.
Recovery and restoration are separate concepts. In this context, recovery involves bringing business operations and processes back to a working state. Restoration involves bringing a business facility and environment back to a workable state.

Training, Awareness, and Documentation
When designing a training plan, consider including the following elements:
- Orientation training for all new employees
- Initial training for employees taking on a new disaster recovery role for the first time
- Detailed refresher training for disaster recovery team members
- Brief awareness refreshers for all other employees (can be accomplished as part of other meetings and through a medium like email newsletters sent to all employees)

Testing and Maintenance
Every disaster recovery plan must be tested on a periodic basis to ensure that the plan’s provisions are viable and that it meets an organization’s changing needs. The types of tests that you conduct will depend on the types of recovery facilities available to you, the culture of your organization, and the availability of disaster recovery team members.

The five main test types
- checklist tests,
- structured walk-throughs,
- simulation tests,
-parallel tests, and
- full-interruption tests.

NIST Special Publication 800-84, Guide to Test, Training, and Exercise Programs for IT Plans and Capabilities Recommendations

6.1 Read-Through Test тест на чтение
The read-through test is one of the simplest tests to conduct, but it’s also one of the most critical. In this test, you distribute copies of disaster recovery plans to the members of the disaster recovery team for review.

This lets you accomplish three goals simultaneously:
- It ensures that key personnel are aware of their responsibilities and have that knowledge refreshed periodically.
- It provides individuals with an opportunity to review the plans for obsolete information and update any items that require modification because of changes within the organization.
- In large organizations, it helps identify situations in which key personnel have left the company and nobody bothered to reassign their disaster recovery responsibilities. This is also a good reason why disaster recovery responsibilities should be included in job descriptions.

6.2 Structured Walk-Through структурированное пошаговое руководство
A structured walk-through takes testing one step further. In this type of test, often referred to as a tabletop exercise, members of the disaster recovery team gather in a large conference room and role-play a disaster scenario. Usually, the exact scenario is known only to the test moderator, who presents the details to the team at the meeting. The team members then refer to their copies of the disaster recovery plan and discuss the appropriate responses to that particular type of disaster.

6.3 Simulation Test
Simulation tests are similar to the structured walk-throughs. In simulation tests, disaster recovery team members are presented with a scenario and asked to develop an appropriate response. Unlike with the tests previously discussed, some of these response measures are then tested. This may involve the interruption of noncritical business activities and the use of some operational personnel.

6.4 Parallel tests represent the next level in testing and involve relocating personnel to the alternate recovery site and implementing site activation procedures. The employees relocated to the site perform their disaster recovery responsibilities just as they would for an actual disaster. The only difference is that operations at the main facility are not interrupted. That site retains full responsibility for conducting the day-to-day business of the organization.

6.5 Full-Interruption Test
Full-interruption tests operate like parallel tests, but they involve actually shutting down operations at the primary site and shifting them to the recovery site. These tests involve a significant risk, since they require the operational shutdown of the primary site and transfer to the recovery site, followed by the reverse process to restore operations at the primary site. For this reason, full-interruption tests are extremely difficult to arrange, and you often encounter resistance from management.

6.6 Lessons Learned
At the conclusion of any disaster recovery operation or other security incident, the organization should conduct a lessons learned session. The lessons learned process is designed to provide everyone involved with the incident response effort an opportunity to reflect on their individual roles in the incident and the team’s response overall. It is an opportunity to improve the processes and technologies used in incident response to better respond to future security crises.

In SP 800-61, NIST offers a series of questions to use in the lessons learned process. They include the following:

Exactly what happened and at what times?
How well did staff and management perform in dealing with the incident?
Were documented procedures followed?
Were the procedures adequate?
Were any steps or actions taken that might have inhibited the recovery?
What would the staff and management do differently the next time a similar incident occurs?
How could information sharing with other organizations have been improved?
What corrective actions can prevent similar incidents in the future?
What precursors or indicators should be watched for in the future to detect similar incidents?
What additional tools or resources are needed to detect, analyze, and mitigate future incidents?

6.7 Maintenance
Remember that a disaster recovery plan is a living document. As your organization’s needs change, you must adapt the disaster recovery plan to meet those changed needs to follow suit. You will discover many necessary modifications by using a well organized and coordinated testing plan. Minor changes may often be made through a series of telephone conversations or emails, whereas major changes may require one or more meetings of the full disaster recovery team.