Lesson 14: Explaining Risk Management and Disaster Recovery Concepts Flashcards

1
Q

vulnerable business processes

A

If a company operates with one or more vulnerable business processes, it could result in disclosure, modification, loss, destruction, or interruption of critical data or it could lead to loss of service to customers. Quite apart from immediate financial losses arising from such security incidents, either outcome will reduce a company’s reputation. If a bank lost its trading floor link to its partners, even for an hour, since the organization’s primary function (trading) would be impossible, huge losses may result. Consequently, when planning a network or other IT system, you must consider the impact of data loss and service unavailability on the organization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Risk management

A

rocess for identifying, assessing, and mitigating vulnerabilities and threats to the essential functions that a business must perform to serve its customers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Risk management performed over five phases:

A
  1. Identify mission essential functions—mitigating risk can involve a large amount of expenditure, so it is important to focus efforts. Part of risk management is to analyze workflows and identify the mission essential functions that could cause the whole business to fail if they are not performed. Part of this process also involves identifying critical systems and assets that support these functions.
  2. Identify vulnerabilities—for each function or workflow (starting with the most critical), analyze systems and assets to discover and list any vulnerabilities or weaknesses to which they may be susceptible. Vulnerability refers to a specific flaw or weakness that could be exploited to overcome a security system.
  3. Identify threats—for each function or workflow, identify the threats that may take advantage of or exploit or accidentally trigger vulnerabilities. Threat refers to the sources or motivations of people and things that could cause loss or damage.
  4. Analyze business impacts—the likelihood of a vulnerability being activated as a security incident by a threat and the impact of that incident on critical systems give factors for evaluating risks. There are quantitative and qualitative methods of analyzing impacts.
  5. Identify risk response—for each risk, identify possible countermeasures and assess the cost of deploying additional security controls. Most risks require some sort of mitigation, but other types of response might be more appropriate for certain types and level of risks.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

mission essential function (MEF)

A

one that cannot be deferred. This means that the organization must be able to perform the function as close to continually as possible, and if there is any service disruption, the mission essential functions must be restored first.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Analysis of mission essential functions is generally governed by four main metrics:

A
  • Maximum tolerable downtime (MTD) is the longest period of time that a business function outage may occur for without causing irrecoverable business failure. Each business process can have its own MTD, such as a range of minutes to hours for critical functions, 24 hours for urgent functions, 7 days for normal functions, and so on. MTDs vary by company and event. Each function may be supported by multiple systems and assets. The MTD sets the upper limit on the amount of recovery time that system and asset owners have to resume operations. For example, an organization specializing in medical equipment may be able to exist without incoming manufacturing supplies for three months because it has stockpiled a sizeable inventory. After three months, the organization will not have sufficient supplies and may not be able to manufacture additional products, therefore leading to failure. In this case, the MTD is three months.
  • Recovery time objective (RTO) is the period following a disaster that an individual IT system may remain offline. This represents the amount of time it takes to identify that there is a problem and then perform recovery (restore from backup or switch in an alternative system, for instance).
  • Work Recovery Time (WRT). Following systems recovery, there may be additional work to reintegrate different systems, test overall functionality, and brief system users on any changes or different working practices so that the business function is again fully supported.

Note: RTO+WRT must not exceed MTD!

• Recovery Point Objective (RPO) is the amount of data loss that a system can sustain, measured in time. That is, if a database is destroyed by a virus, an RPO of 24 hours means that the data can be recovered (from a backup copy) to a point not more than 24 hours before the database was infected.

For example, a customer leads database might be able to sustain the loss of a few hours’ or days’ worth of data (the salespeople will generally be able to remember who they have contacted and re-key the data manually). Conversely, order processing may be considered more critical, as any loss will represent lost orders and it may be impossible to recapture web orders or other processes initiated only through the computer system, such as linked records to accounting and fulfilment.

MTD and RPO help to determine which business functions are critical and also to specify appropriate risk countermeasures. For example, if your RPO is measured in days, then a simple tape backup system should suffice; if RPO is zero or measured in minutes or seconds, a more expensive server cluster backup and redundancy solution will be required.

For most businesses, the most critical functions will be those that enable customers to find them and for the business to interact with those customers. In practical terms, this means telecoms and web presence. Following that is probably the capability to fulfil products and services. Back-office functions such as accounting, HR, and marketing are probably necessary rather than critical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

identification of critical systems

A

To support the resiliency of mission essential and primary business functions, it is crucial for an organization to perform the identification of critical systems. This means compiling an inventory of its business processes and its tangible and intangible assets and resources. These could include:

  • People (employees, visitors, and suppliers).
  • Tangible assets (buildings, furniture, equipment and machinery (plant), ICT equipment, electronic data files, and paper documents).
  • Intangible assets (ideas, commercial reputation, brand, and so on).
  • Procedures (supply chains, critical procedures, standard operating procedures).

It is important to be up to date with best practice and standards relevant to the type of business or organization. This can help to identify procedures or standards that are not currently being implemented but should be. Make sure that the asset identification process captures system architecture as well as individual assets (that is, understand and document the way assets are deployed, utilized, and how they work together).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

business process analysis (BPA)

A

For mission essential functions, it is important to reduce the number of dependencies between components. Dependencies are identified by performing a business process analysis (BPA) for each function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The BPA should identify the following factors:

A
  • Inputs—the sources of information for performing the function (including the impact if these are delayed or out of sequence).
  • Hardware—the particular server or data center that performs the processing.
  • Staff and other resources supporting the function.
  • Outputs—the data or resources produced by the function.
  • Process flow—a step-by-step description of how the function is performed.

Reducing dependencies makes it easier to provision redundant systems to allow the function to failover to a backup system smoothly. This means the system design can more easily eliminate the sort of weakness that comes from having single points of failure (SPoF) that can disrupt the function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Key performance indicators (KPI)

A

Each IT system will be supported by assets, such as servers, disk arrays, switches, routers, and so on. Key performance indicators (KPI) can be used to determine the reliability of each asset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Some of the main KPIs relating to service availability are as follows:

A
  • Mean Time to Failure (MTTF) and Mean Time Between Failures (MTBF) represent the expected lifetime of a product. MTTF should be used for non-repairable assets. For example, a hard drive may be described with an MTTF, while a server (which could be repaired by replacing the hard drive) would be described with an MTBF. You will often see MTBF used indiscriminately, however. For most devices, failure is more likely early and late in life, producing the so-called “bathtub curve.”
  • The calculation for MTBF is the total time divided by the number of failures. For example, if you have 10 devices that run for 50 hours and two of them fail, the MTBF is 250 hours/failure (10*50)/2.
  • The calculation for MTTF for the same test is the total time divided by the number of devices, so (10*50)/10, with the result being 50 hours/failure.

MTTF/MTBF can be used to determine the amount of asset redundancy a system should have. A redundant system can failover to another asset if there is a fault and continue to operate normally. It can also be used to work out how likely failures are to occur.

• Mean Time to Repair (MTTR) is a measure of the time taken to correct a fault so that the system is restored to full operation. This can also be described as mean time to “replace” or “recover.” This metric is important in determining the overall Recovery Time Objective (RTO).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

asset management process

A

An asset management process takes inventory of and tracks all the organization’s critical systems, components, devices, and other objects of value. It also involves collecting and analyzing information about these assets so that personnel can make more informed changes or otherwise work with assets to achieve business goals. There are many software suites and associated hardware solutions available for tracking and managing assets (or inventory). An asset management database can be configured to store as much or as little information as is deemed necessary, though typical data would be type, model, serial number, asset ID, location, user(s), value, and service information. Tangible assets can be identified using a barcode label or Radio Frequency ID (RFID) tag attached to the device (or more simply, using an identification number). An RFID tag is a chip programmed with asset data. When in range of a scanner, the chip activates and signals the scanner. The scanner alerts management software to update the device’s location. As well as asset tracking, this allows the management software to track the location of the device, making theft more difficult.

Within the inventory of assets and business processes, it is important to assess their relative importance. In the event of a disaster that requires that recovery processes take place over an extended period, critical systems must be prioritized over merely necessary ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

It is also important to realize that asset management procedures can easily go astray—assets get mislabeled, new assets are not recorded, and so on. In these cases, some troubleshooting tactics can include:

A
  • Ensure that all relevant assets are participating in a tracking system like barcodes or passive radio frequency IDs (RFIDs).
  • Ensure that there is a process in place for tagging newly acquired or developed assets.
  • Ensure that there is a process in place for removing obsolete assets from the system.
  • Check to see if any assets have conflicting IDs.
  • Check to see if any assets have inaccurate metadata.
  • Ensure that asset management software can correctly read and interpret tracking tags.
  • Update asset management software to fix any bugs or security issues.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Threat assessment

A

means compiling a prioritized list of probable and possible threats. Some of these can be derived from the list of assets (that is, threats that are specific to your organization); others may be non-specific to your particular organization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

important to note that threats could be created by something that the organization is not doing or an asset that it does not own as much as they can from things that it is doing or assets it does own. Consider (for instance) the impact on business processes of the following:

A
  • Public infrastructure (transport, utilities, law and order).
  • Supplier contracts (security of supply chain).
  • Customer’s security (the sudden failure of important customers due to their own security vulnerabilities can be as damaging as an attack on your own organization).
  • Epidemic disease.

A large part of threat assessment will identify human threat actors, both internal and external to the organization, so try to understand their motives to assess the level of risk that each type of threat actor poses. Threat actors discussed earlier—such as hackers, organized crime, nation state actors, and insider threat—can all be described as working with some sort of intent. Another threat source is the all-too-human propensity for carelessness and, consequently, accidental damage. Misuse of a system by a naïve user may not intend harm but can nonetheless cause widespread disruption. Misconfiguration of a system can create vulnerabilities that might be exploited by other threat agents. Threat actors also need not be human.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Threat awareness must consider threats posed by events such as natural disasters, accidents, and by legal liabilities:

A
  • Natural disaster—threat sources such as river or sea floods, earthquakes, storms, and so on. Natural disasters may be quite predictable (as is the case with areas prone to flooding or storm damage) or unexpected, and therefore difficult to plan for.
  • Manmade disaster—intentional man-made threats such as terrorism, war, or vandalism/arson or unintentional threats, such as user error or information disclosure through social media platforms.
  • Environmental—those caused by some sort of failure in the surrounding environment. These could include power or telecoms failure, pollution, or accidental damage (including fire).
  • Legal and commercial—some examples include:
  • Downloading or distributing obscene material.
  • Defamatory comments published on social networking sites.
  • Hijacked mail or web servers used for spam or phishing attacks.

Third-party liability for theft or damage of personal data.

• Accounting and regulatory liability to preserve accurate records.

These cases are often complex, but even if there is no legal liability, the damage done to the organization’s reputation could be just as serious.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

supply chain

A

Threat assessment should not be confined to analyzing your own business. You must also consider critical suppliers. A supply chain is a series of companies involved in fulfilling a product. Assessing a supply chain involves determining whether each link in the chain is sufficiently robust. Each supplier in the chain may have their own suppliers, and assessing “robustness” means obtaining extremely privileged company information. Consequently, assessing the whole chain is an extremely complex process and is an option only available to the largest companies. Most businesses will try to identify alternative sources for supplies so that the disruption to a primary supplier does not represent a single point of failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

For each business process and each threat, you must assess the degree of risk that exists. Calculating risk is complex, but the two main variables are likelihood and impact:

A
  • Likelihood is the probability of the threat being realized.
  • Impact is the severity of the risk if realized as a security incident. This may be determined by factors such as the value of the asset or the cost of disruption if the asset is compromised.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Business impact analysis (BIA)

A

process of assessing what losses might occur for each threat scenario. For instance, if a roadway bridge crossing a local river is washed out by a flood and employees are unable to reach a business facility for five days, estimated costs to the organization need to be assessed for lost manpower and production. Impacts can be categorized in several ways.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

impacts on life and safety

A

The most critical type of impact is one that could lead to loss of life or critical injury. The most obvious risks to life and safety come from natural disasters, man-made disasters, and accidents (such as fire). Sometimes industries have to consider life and safety impacts in terms of the security of their products, however. For example, a company makes wireless adapters, originally for use with laptops. The security of the firmware upgrade process is important, but it has no impact on life or safety. The company, however, earns a new contract to supply the adapters to provide connectivity for in-vehicle electronics systems. Unknown to the company, a weakness in the design of the in-vehicle system allows an adversary to use compromised wireless adapter firmware to affect the car’s control systems (braking, acceleration, and steering). The integrity of the upgrade process now has an impact on safety.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

impacts on property

A

Again, risks whose impacts affect property (premises) mostly arise due to natural disaster, war/terrorism, and fire.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

impacts on finance and reputation

A

It is important to realize that the value of an asset does not refer solely to its material value. The two principal additional considerations are direct costs associated with the asset being compromised (downtime) and consequent costs to intangible assets, such as the company’s reputation. For example, a server may have a material cost of a few hundred dollars. If the server were stolen, the costs incurred from not being able to do business until it can be recovered or replaced could run to thousands of dollars. In addition, that period of interruption where orders cannot be taken or go unfulfilled leads customers to look at alternative suppliers, resulting in perhaps more thousands of lost sales and goodwill.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

impacts on privacy

A

Another important source of risk is the unauthorized disclosure of personally identifiable information (PII). The theft or loss of PII can have an enormous impact on an individual because of the risk of identity theft and because once disclosed, the PII cannot easily be changed or recovered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Organizations should perform regular audits to assess whether PII is processed securely. These may be modelled on formal audit documents mandated by US laws, notably The Privacy Act and the Federal Information Security Management Act (FISMA):

A
  • Privacy Threshold Analysis (PTA)—An initial audit to determine whether a computer system or workflow collects, stores, or processes PII to a degree where a PIA must be performed. PTAs must be repeated every three years.
  • Privacy Impact Assessment (PIA)—A detailed study to assess the risks associated with storing, processing, and disclosing PII. The study should identify vulnerabilities that may lead to data breach and evaluate controls mitigating those risks.
  • System of Records Notice (SORN)—A formal document listing PII maintained by a federal agency of the US government.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

There are two methods of assessing likelihood and risk:

A

quantitative and qualitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Quantitative risk assessment aims to assign concrete values to each risk factor.

A
  • Single Loss Expectancy (SLE)—The amount that would be lost in a single occurrence of the risk factor. This is determined by multiplying the value of the asset by an Exposure Factor (EF). EF is the percentage of the asset value that would be lost.
  • Annual Loss Expectancy (ALE)—The amount that would be lost over the course of a year. This is determined by multiplying the SLE by the Annual Rate of Occurrence (ARO).

The problem with quantitative risk assessment is that the process of determining and assigning these values is complex and time consuming. The accuracy of the values assigned is also difficult to determine without historical data (often, it has to be based on subjective guesswork). However, over time and with experience, this approach can yield a detailed and sophisticated description of assets and risks and provide a sound basis for justifying and prioritizing security expenditure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Qualitative risk assessment

A

avoids the complexity of the quantitative approach and is focused on identifying significant risk factors. The qualitative approach seeks out people’s opinions of which risk factors are significant. Assets and risks may be placed in simple categories. For example, assets could be categorized as Irreplaceable, High Value, Medium Value, and Low Value; risks could be categorized as one-off or recurring and as Critical, High, Medium, and Low probability.

Another simple approach is the “Traffic Light” impact grid. For each risk, a simple Red, Yellow, or Green indicator can be put into each column to represent the severity of the risk, its likelihood, cost of controls, and so on. This approach is simplistic but does give an immediate impression of where efforts should be concentrated to improve security.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

FIPS 199 (https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.199.pdf) discusses how to apply Security Categorizations (SC) to information systems based on the impact that a breach of confidentiality, integrity, or availability would have on the organization as a whole. Potential impacts can be classified as:

A
  • Low—minor damage or loss to an asset or loss of performance (though essential functions remain operational).
  • Moderate—significant damage or loss to assets or performance.
  • High—major damage or loss or the inability to perform one or more essential functions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Having performed the asset and threat identification and completed a risk assessment, risk response options can be identified and prioritized. For example, you might focus on the following systems:

A
  • High value asset, regardless of the likelihood of the threat(s).
  • Threats with high likelihood (that is, high ARO).
  • Procedures, equipment, or software that increase the likelihood of threats (for example, legacy applications, lack of user training, old software versions, unpatched software, running unnecessary services, not having auditing procedures in place, and so on).

In theory, security controls or countermeasures could be introduced to address every vulnerability. The difficulty is that security controls can be expensive, so you must balance the cost of the control with the cost associated with the risk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Risk mitigation (or remediation)

A

It is not often possible to eliminate risk; rather the aim is to mitigate risk factors to the point where the organization is exposed only to a level of risk that it can afford (residual risk). Risk mitigation (or remediation) is the overall process of reducing exposure to or the effects of risk factors. There are several ways of mitigating risk. If you deploy a countermeasure that reduces exposure to a threat or vulnerability that is risk deterrence (or reduction). Risk reduction refers to controls that can either make a risk incident less likely or less costly (or perhaps both). For example, if fire is a threat, a policy strictly controlling the use of flammable materials on site reduces likelihood while a system of alarms and sprinklers reduces impact by (hopefully) containing any incident to a small area. Another example is offsite data backup, which provides a remediation option in the event of servers being destroyed by fire.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Other risk response strategies are as follows:

A

• Avoidance means that you stop doing the activity that is risk-bearing.

For example, a company may develop an in-house application for managing inventory and then try to sell it. If while selling it, the application is discovered to have numerous security vulnerabilities that generate complaints and threats of legal action, the company may make the decision that the cost of maintaining the security of the software is not worth the revenue and withdraw it from sale.

Obviously, this would generate considerable bad feeling amongst existing customers. Avoidance is not often a credible option.

• Transference (or sharing) means assigning risk to a third party (such as an insurance company or a contract with a supplier that defines liabilities). For example, a company could stop in-house maintenance of an e‑commerce site and contract the services to a third party, who would be liable for any fraud or data theft.

Note: Note that in this sort of case, it is relatively simple to transfer the obvious risks, but risks to the company’s reputation remain. If a customer’s credit card details are

stolen because they used your unsecure e‑commerce application, the customer won’t care if you or a third party were nominally responsible for security. It is also unlikely that legal liabilities could be completely transferred in this way.

• Acceptance (or retention) means that no countermeasures are put in place either because the level of risk does not justify the cost or because there will be unavoidable delay before the countermeasures are deployed. In this case, you should continue to monitor the risk (as opposed to ignoring it).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

risk register

A

document showing the results of risk assessments in a comprehensible format. The register may resemble the “traffic light” grid shown earlier with columns for impact and likelihood ratings, date of identification, description, countermeasures, owner/route for escalation, and status. Risk registers are also commonly depicted as scatterplot graphs, where impact and likelihood are each an axis, and the plot point is associated with a legend that includes more information about the nature of the plotted risk. A risk register should be shared between stakeholders (executives, department managers, and senior technicians) so that they understand the risks associated with the workflows that they manage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

The need to change is often described either as reactive or proactive

A

In order to reduce the risk that changes to configuration items will cause service disruption, a documented change management process can be used to implement changes in a planned and controlled way. The need to change is often described either as reactive, where the change is forced on the organization, or as proactive, where the need for change is initiated internally. Changes can also be categorized according to their impact and level of risk (major, significant, minor, or normal, for instance).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Request for Change (RFC) document

A

In a formal change management process, the need for change and the procedure for implementing the change is captured in a Request for Change (RFC) document and submitted for approval. The RFC will then be considered at the appropriate level. This might be a supervisor or department manager if the change is normal or minor. Major or significant changes might be managed as a separate project and require approval through a Change Advisory Board (CAB).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Follow these guidelines when putting risk management processes in place:

A
  • Identify mission-essential functions and the critical systems within each function.
  • Identify those assets supporting business functions and critical systems, and determine their values.
  • Calculate MTD, RPO, RTO, MTTF, MTTR, and MTBF for functions and assets.
  • Look for possible vulnerabilities that, if exploited, could adversely affect each function or system.
  • Determine potential threats to functions and systems.
  • Determine the probability or likelihood of a threat exploiting a vulnerability.
  • Determine the impact of the potential threat, whether it be recovery from a failed system or the implementation of security controls that will reduce or eliminate risk.
  • Identify impact scenarios that put your business operations at risk.
  • Identify the risk analysis method that is most appropriate for your organization. For quantitative and semi-quantitative risk analysis, calculate SLE and ARO for each threat, and then calculate the ALE.
  • Identify potential countermeasures, ensuring that they are cost-effective and perform as expected. For example, identify single points of failure and, where possible, establish redundant or alternative systems and solutions.
  • Clearly document all findings discovered and decisions made during the assessment in a risk register.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Continuity of Operations Planning (COOP), sometimes referred to as a business continuity plan (BCP)

A

a collection of processes that enable an organization to maintain normal business operations in the face of some adverse event. There are numerous types of events, both natural and man-made, that could disrupt the business and require a continuity effort to be put in place. They may be instigated by a malicious party, or they may come about due to careless or negligence on the part of non-malicious personnel. The organization may suffer loss or leakage of data; damage to or destruction of hardware and other physical property; impairment of communications infrastructure; loss of or harm done to personnel; and more. When these negative events become a reality, the organization will need to rely on resiliency and automation strategies to mitigate their effect on day-to-day operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

single points of failure

A

Computer systems require protection from hardware failure, software failure, and system failure (failure of network connectivity devices, for instance).

When implementing a network, the goal will always be to minimize the single points of failure and to allow ongoing service provision despite a disaster. To perform IT Contingency Planning (ITCP), think of all the things that could fail, determine whether the result would be a critical loss of service, and whether this is unacceptable. Then identify strategies to make the system resilient. How resilient a system is can be determined by measuring or evaluating several properties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

high availability

A

One of the key properties of a resilient system is high availability. Availability is the percentage of time that the system is online, measured over the defined period (typically one year). The corollary of availability is downtime (that is, the percentage or amount of time during which the system is unavailable). The maximum tolerable downtime (MTD) metric states the requirement for a particular business function. High availability is usually loosely described as 24x7 (24 hours per day, 7 days per week) or 24x365 (24 hours per day, 365 days per year). For a critical system, availability will be described as “two-nines” (99%) up to five- or six-nines (99.9999%).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

fault tolerant

A

A system that can experience failures and continue to provide the same (or nearly the same) level of service is said to be fault tolerant. Fault tolerance is often achieved by provisioning redundancy for critical components and single points of failure. A redundant component is one that is not essential to the normal function of a system but that allows the system to recover from the failure of another component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Examples of devices and solutions that provide fault tolerance include the following:

A
  • Redundant components (power supplies, network cards, drives (RAID), and cooling fans) provide protection against hardware failures. Hot swappable components allow for easy replacement (without having to shut down the server).
  • Uninterruptible Power Supplies (UPS) and Standby Power Supplies.
  • Backup strategies—provide protection for data.
  • Cluster services are a means of ensuring that the total failure of a server does not disrupt services generally.

While these computer systems are important, thought also needs to be given about how to make a business “fault tolerant” in terms of staffing, utilities (heat, power, communications, transport), customers, and suppliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Scalability

A

A resilient system does not just need to be able to cope with faults and outages, but it must also be able to cope with changing demand levels. These properties are measured as scalability and elasticity

means that the costs involved in supplying the service to more users are linear. For example, if the number of users doubles in a scalable system, the costs to maintain the same level of service would also double (or less than double). If costs more than double, the system is less scalable.

To scale out is to add more resources in parallel with existing resources. To scale up is to increase the power of existing resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Elasticity

A

A resilient system does not just need to be able to cope with faults and outages, but it must also be able to cope with changing demand levels. These properties are measured as scalability and elasticity

refers to the system’s ability to handle changes in demand in real time. A system with high elasticity will not experience loss of service or performance if demand suddenly doubles (or triples, or quadruples). Conversely, it may be important for the system to be able to reduce costs when demand is low. Elasticity is a common selling point for cloud services. Instead of running a cloud resource for 24 hours a day, 7 days a week, that resource can diminish in power or shut down completely when demand for that resource is low. When demand picks up again, the resource will grow in power to the level required. This results in cost-effective operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Distributive allocation

A

refers to the ability to switch between available processing and data resources to meet service requests. This is typically achieved using load balancing services during normal operations or automated failover during a disaster.

43
Q

Redundant Array of Independent Disks (RAID)

A

many disks can act as backups for each other to increase reliability and fault tolerance. If one disk fails, the data is not lost, and the server can keep functioning. The RAID advisory board defines RAID levels, numbered from 0 to 6, where each level corresponds to a specific type of fault tolerance. There are also proprietary and nested RAID solutions. Some of the most commonly implemented types of RAID are listed in the following table.

44
Q

RAID Level 0

A

Striping without parity (no fault tolerance). This means that data is written in blocks across several disks simultaneously. This can improve performance, but if one disk fails, so does the whole volume and data on it will be corrupted.

45
Q

RAID Level 1

A

Mirroring—Data is written to two disks simultaneously, providing redundancy (if one disk fails, there is a copy of data on the other). The main drawback is that storage efficiency is only 50%.

46
Q

RAID Level 5

A

Striping with parity—Data is written across three or more disks, but additional information (parity) is calculated. This allows the volume to continue if one disk is lost. This solution has better storage efficiency than RAID 1.

47
Q

RAID Level 6

A

Double parity or level 5 with an additional parity stripe. This allows the volume to continue when two disks have been lost.

48
Q

Nested (0+1, 1+0, 5+0)

A

Nesting RAID sets generally improves performance or redundancy (for example, some nested RAID solutions can support the failure of more than one disk).

49
Q

multiple paths

A

Network cabling should be designed to allow for multiple paths between the various servers, so that during a failure of one part of the network, the rest remains operational (redundant connections). Routers are great fault tolerant devices, because they can communicate system failures and IP packets can be routed via an alternate device.

50
Q

automated courses of action

A

There are very few parts of IT infrastructure that cannot be automated through some sort of code (either a program or a script). Technologies such as Software Defined Networking (SDN), virtualization, and DevOps make it possible to provision network links and server systems through programming and scripting. This means that a resiliency strategy can specify automated courses of action that can work to maintain or to restore services with minimal human intervention or even no intervention at all.

51
Q

continuous monitoring

A

An automation solution will have a system of continuous monitoring to detect service failures and security incidents. Continuous monitoring might use a locally installed agent or heartbeat protocol or may involve checking availability remotely. As well as monitoring the primary site, it is important to observe the failover components to ensure that they are recovery ready. You can also automate the courses of action that a monitoring system takes, like configuring an IPS to automatically block traffic that it deems suspicious.

52
Q

When provisioning a new or replacement instance automatically, the automation system may use one of two types of mastering instructions:

A
  • Master image—this is the “gold” copy of a server instance, with the OS, applications, and patches all installed and configured. This is faster than using a template, but keeping the image up to date can involve more work than updating a template.
  • Template—similar to a master image, this is the build instructions for an instance. Rather than storing a master image, the software may build and provision an instance according to the template instructions.

Another important process in automating resiliency strategies is to provide configuration validation. This process ensures that a recovery solution is working at each layer (hardware, network connectivity, data replication, and application). An automation solution for incident and disaster recovery will have a dashboard of key indicators and may be able to evaluate metrics such as compliance with RPO and RTO from observed data.

53
Q

non-persistence

A

When recovering systems, it may be necessary to ensure that any artifacts from the disaster, such as malware or backdoors, are removed when reconstituting the production environment. This can be facilitated in an environment designed for non-persistence. Non-persistence means that any given instance is completely static in terms of processing function. Data is separated from the instance so that it can be swapped out for an “as new” copy without suffering any configuration problems.

54
Q

There are various mechanisms for ensuring non-persistence:

A
  • Snapshot/revert to known state—This is a saved system state that can be reapplied to the instance.
  • Rollback to known configuration—A physical instance might not support snapshots but has an “internal” mechanism for restoring the baseline system configuration, such as Windows System Restore.
  • Live boot media—another option is to use an instance that boots from read-only storage to memory rather than being installed on a local read/write hard disk.
55
Q

Follow these guidelines when developing a Continuity of Operations Plan (COOP):

A
  • Be aware of the different ways your business could be threatened.
  • Implement an overall business continuity process in response to real events.
  • Ensure the continuity planning is comprehensive and addresses all critical dimensions of the organization.
  • Draft an IT contingency plan to ensure that IT procedures continue after an adverse event.
  • Ensure that IT personnel are trained on this plan.
  • Incorporate failover techniques into continuity planning.
  • Ensure that systems are highly available and meet an adequate level of performance.
  • Ensure that critical systems have redundancy to mitigate loss of data and resources due to adverse events.
  • Ensure that critical systems are fault tolerant so that service disruption is minimized in the event of failure or compromise.
  • Ensure that systems are adequately scalable and can meet the long-term increase in demand as the business grows.
  • Ensure that systems are elastic and can meet the short-term increase and decrease in resource demands.
  • Consider consolidating multiple storage devices in a RAID for redundancy and fault tolerance.
  • Choose the RAID level that provides the appropriate level of redundancy and fault tolerance for your business needs.
  • Supplement manual security processes with automated processes in order to increase efficiency and accuracy.
  • Consider incorporating non-persistent virtual infrastructure to more easily maintain baseline security.
56
Q

Continuity of Operation Planning (COOP)

A

As you have seen, part of Continuity of Operation Planning (COOP) is to provision fault tolerant systems that provide high availability through redundancy and failover. This sort of well-engineered system will hopefully be resilient to most types of fault and allow any recovery or maintenance operations to be performed in the background.

57
Q

alternate processing sites or recovery sites

A

Providing redundant devices and spares or configuring a server cluster on the local network allows the redundant systems to be swapped in if existing systems fail. Enterprise-level networks often also provide for alternate processing sites or recovery sites. A site is another location that can provide the same (or similar) level of service. An alternate processing site might always be available and in use, while a recovery site might take longer to set up or only be used in an emergency.

58
Q

failover

A

Operations are designed to failover to the new site until the previous site can be brought back online. Failover is a technique that ensures a redundant component, device, application, or site can quickly and efficiently take over the functionality of an asset that has failed. For example, load balancers provide failover in the event that one or more servers or sites behind the load balancer are down or are taking too long to respond. Once the load balancer detects this, it will redirect inbound traffic to an alternate processing server or site. Thus, redundant servers in the load balancer pool ensure there is no interruption of service.

59
Q

hot site

A

Recovery sites are referred to as being hot, warm, or cold. A hot site can failover almost immediately. It generally means that the site is already within the organization’s ownership and is ready to deploy.

60
Q

cold site

A

A cold site takes longer to set up (up to a week), and a warm site is something between the two.

61
Q

warm site

A

A warm site could be similar, but with the requirement that the latest data set will need to be loaded.

62
Q

subscription service

A

Clearly, providing redundancy on this scale can be very expensive. Sites are often leased from service providers, such as Comdisco or IBM (a subscription service).

63
Q

reciprocal arrangements

A

Another option is for businesses to enter into reciprocal arrangements to provide mutual support. This is cost effective but complex to plan and set up.

Another issue is that creating a duplicate of anything doubles the complexity of securing that resource properly. The same security procedures must apply to redundant sites, spare systems, and backup data as apply to the main copy.

64
Q

In terms of choosing the location of an alternate processing or recovery site, the following factors are important geographic considerations:

A
  • location selection
  • distance and replication
  • legal implications/data sovereignty
65
Q

location selection

A

Choosing the location for a processing facility or data center requires considering multiple factors. A geographically remote site has advantages in terms of deterring and detecting intruders. It is much easier to detect suspicious activity in a quiet, remote environment than it is in a busy, urban one. On the other hand, a remote location carries risks. Infrastructure (electricity, heating, water, telecommunications, and transport links) may not be as reliable and require longer to repair. Recruitment and retention of skilled employees can also be more difficult.

In many locations, flooding is the most commonly encountered natural disaster hazard. Rising sea levels and changing rainfall patterns mean that previously safe areas can become subject to flood risks within just a few years. Without spending a lot of money on a solution, common-sense measures can be taken to minimize the impact of flood. If possible, the computer equipment and cabling should be positioned above the ground floor and away from major plumbing.

Certain local areas may also be subject to specific known hazards, such as earthquakes, volcanoes, and storms. If there is no other choice as to location, natural disaster risks such as this can often be mitigated by building designs that have been developed to cope with local conditions.

66
Q

distance and replication

A

As well as being a suitable location for a data processing center, you must also consider the distance between the primary site and the secondary (alternate or recovery) site. Determining the optimum distance between two replicating sites depends on evaluating competing factors:

  • Locating the alternate site a short distance from the primary site—in the same city, for example—makes it easier for personnel at the primary site to resume operations at the recovery site, or to physically transfer data from the backup site to the primary site.
  • If the sites are too close together (within about 500km), they could both be affected by the same disaster. For example, the entire Southeastern United States is susceptible to hurricane season. To avoid a disaster resulting from a hurricane, an organization with a primary site in Florida may choose to keep a recovery site in a different part of the country.
  • The farther apart the sites are, the costlier replication will be. Replication is the process of duplicating data between different servers or sites. RAID mirroring and server clustering are examples of disk-to-disk and server-to-server replication. Replication can either be synchronous or asynchronous. Synchronous replication means that the data must be written at both sites before it can be considered committed. Asynchronous replication means that data is mirrored from a primary site to a secondary site. Disk-to-disk and server-to-server replication are relatively simple to accomplish as they can use direct access RAID or local network technologies. Site-to-site replication is considerably harder and more expensive as it relies on Wide Area Network technologies. Synchronous replication is particularly sensitive to distance, as the longer the communications pathway, the greater the latency of the link. Latency can be mitigated by provisioning fiber optic links.
67
Q

legal implications/data sovereignty

A

For an organization handling cross-border transactions, there is the need to respect the national laws affecting privacy and data processing in which a site is located. A different state or country will likely have its own specific laws and regulations that your data will be subject to. You may be forced to apply different data retention practices than what you’re used to at your primary site or other local alternate sites. Aside from the direct legal implications, you must also consider the concept of data sovereignty. Data sovereignty describes the sociopolitical outlook of a nation concerning computing technology and information. Some nations may respect data privacy more or less than others; and likewise, some nations may disapprove of the nature and content of certain data. They may even be suspicious of security measures such as encryption. There might be data sovereignty implications for cloud services, for replicating sites, and for data backups and archiving, if data is copied from one country to another.

68
Q

order of restoration

A

If a site suffers an uncontrolled outage, in ideal circumstances processing will be switched to the alternate site and the outage can be resolved without any service interruption. If an alternate processing site is not available, then the main site must be brought back online as quickly as possible to minimize service disruption. This does not mean that the process can be rushed, however. A complex facility such as a data center or campus network must be reconstituted according to a carefully designed order of restoration. If systems are brought back online in an uncontrolled way, there is the serious risk of causing additional power problems or of causing problems in the network, OS, or application layers because dependencies between different appliances and servers have not been met.

69
Q

In very general terms, the order of restoration will be as follows:

A
  1. Enable and test power delivery systems (grid power, Power Distribution Units (PDUs), UPS, secondary generators, and so on).
  2. Enable and test switch infrastructure, then routing appliances and systems.
  3. Enable and test network security appliances (firewalls, IDS, proxies).
  4. Enable and test critical network servers (DHCP, DNS, NTP, and directory services).
  5. Enable and test backend and middleware (databases and business logic). Verify data integrity.
  6. Enable and test front-end applications.
  7. Enable client workstations and devices and client browser access.
70
Q

alternate business practice

A

An alternate business practice will allow the information flow to resume to at least some extent. A typical fallback plan is to handle transactions using pen-and-paper systems. This type of fallback can work only if it is well planned, though. Staff must know how to use the alternate system—what information must be captured (supply standard forms) and to whom it should be submitted (and how, if there are no means of electronic delivery). Alternate business practices can only work if the information flow is well-documented and there are not too many complex dependencies on gathering and processing the data

71
Q

Succession planning

A

As well as risks to systems, a COOP has to take on the macabre issue of human capital resilience. Put bluntly, this means “Is someone else available to fulfill the same role if an employee is incapacitated?” Succession planning targets the specific issue of leadership and senior management. Most business continuity and DR plans are heavily dependent on a few key people to take charge during the disaster and ensure that the plan is put into effect. Succession planning ensures that these sorts of competencies are widely available to an organization.

72
Q

All COOP and DR planning makes use of backups, of one type or another. The execution and frequency of backups must be carefully planned and guided by policies. Data retention needs to be considered in the short and long term:

A
  • In the short term, files that change frequently might need retaining for version control. Short-term retention is also important in recovering from malware infection. Consider the scenario where a backup is made on Monday, a file is infected with a virus on Tuesday, and when that file is backed up later on Tuesday, the copy made on Monday is overwritten. This means that there is no good means of restoring the uninfected version of the file. Short term retention is determined by how often the youngest media sets are overwritten.
  • In the long term, data may need to be stored to meet legal requirements or to comply with company policies or industry standards. Any data that must be retained in a particular version past the oldest sets should be moved to archive storage.
73
Q

recovery window

A

For these reasons, backups are kept back to certain points in time. As backups take up a lot of space, and there is never limitless storage capacity, this introduces the need for storage management routines and techniques to reduce the amount of data occupying backup storage media while giving adequate coverage of the required recovery window. The recovery window is determined by the Recovery Point Objective (RPO), which is determined through business continuity planning. Advanced backup software can prevent media sets from being overwritten in line with the specified retention policy.

74
Q

three different backup types.

A

Full

  • Data Selection: All selected data regardless of when it was previously backed up
  • Backup/Restore Time: High/low (one tape set)
  • Archive Attribute: Cleared

Incremental

  • Data Selection: New files and files modified since the last backup
  • Backup/Restore Time: Low/high (multiple tape sets)
  • Archive Attribute: Cleared

Differential
- Data Selection:
All data modified since the last full backup
- Backup/Restore Time: Moderate/moderate (no more than two sets)
- Archive Attribute: Not Cleared

75
Q

Backup/Restore

A

The criteria for determining which method to use is based on the time it takes to restore versus the time it takes to back up. Assuming a backup is performed every working day, an incremental backup only includes files changed during that day, while a differential backup includes all files changed since the last full backup. Incremental backups save backup time but can be more time-consuming when the system must be restored. The system must be restored from the last full backup set and then from each incremental backup that has subsequently occurred. A differential backup system only involves two tape sets when restoration is required. Doing a full backup on a large network every day takes a long time. A typical strategy for a complex network would be a full weekly backup followed by an incremental or differential backup at the end of each day.

76
Q

Snapshots

A

means of getting around the problem of open files. If the data that you’re considering backing up is part of a database, such as SQL data or a messaging system, such as Exchange, then the data is probably being used all the time. Often copy-based mechanisms will be unable to back up open files. Short of closing the files, and so too the database, a copy-based system will not work.

A snapshot is a point-in-time copy of data maintained by the file system. A backup program can use the snapshot rather than the live data to perform the backup. In Windows, snapshots are provided for on NTFS volumes by the Volume Shadow Copy Service (VSS). They are also supported on Sun’s ZFS file system, and under some enterprise distributions of Linux.

77
Q

backup storage issues

A

Backed up and archived data need to be stored as securely as “live” data. A data backup has the same confidentiality and integrity requirements as its source. Typically, backup media is physically secured against theft or snooping by keeping it in a restricted part of the building, with other server and network equipment. Many backup solutions also use encryption to ensure data confidentiality should the media be stolen.

Additionally, you must plan for events that could compromise both the live data and the backup set. Natural disasters, such as fires, earthquakes, and floods could leave an organization without a data backup, unless they have kept a copy offsite. Offsite storage is obviously difficult to keep up to date.

Without a network that can support the required bandwidth, the offsite media must be physically brought onsite (and if there is no second set of offsite media, data is at substantial risk at this time), the latest backup performed, and then removed to offsite storage again. Quite apart from the difficulty and expense of doing this, there are data confidentiality and security issues in transporting the data

78
Q

disaster recovery plans (DRPs)

A

Within the scope of business continuity planning, disaster recovery plans (DRPs) describe the specific procedures to follow to recover a system or site to a working state. A disaster could be anything from a loss of power or failure of a minor component to man-made or natural disasters, such as fires, earthquakes, or acts of terrorism.

79
Q

The DRP should accomplish the following:

A
  • Identify scenarios for natural and non-natural disaster and options for protecting systems. Plans need to account for risk (a combination of the likelihood the disaster will occur and the possible impact on the organization) and cost.
  • There is no point implementing disaster recovery plans that financially cripple the organization. The business case is made by comparing the cost of recovery measures against the cost of downtime. Downtime cost is calculated from lost revenues and ongoing costs (principally salary). The recovery plan should not generally exceed the downtime cost. Of course, downtime will include indefinable costs, such as loss of customer goodwill, restitution for not meeting service contracts, and so on.
  • Identify tasks, resources, and responsibilities for responding to a disaster.
  • Who is responsible for doing what? How can they be contacted? What happens if they are not available?
  • Which functions are most critical? Where should effort first be concentrated?
  • What resources are available? Should they be pre-purchased and held in stock? Will the disaster affect availability of supplies?
  • Which functions are most critical? Where should effort first be concentrated?
  • What resources are available? Should they be pre-purchased and held in stock? Will the disaster affect availability of supplies?
  • What are the timescales for resumption of normal operations?
  • Train staff in the disaster planning procedures and how to react well to change.

As well as restoring systems, the disaster recovery plan should identify stakeholders who need to be informed about any security incidents. There may be a legal requirement to inform the police, fire service, or building inspectors about any safety-related or criminal incidents. If third-party or personal data is lost or stolen, the data subjects may need to be informed. If the disaster affects services, customers need to be informed about the time-to-fix and any alternative arrangements that can be made.

80
Q

It is necessary to test disaster recovery procedures. There are four means of doing this:

A
  • Walkthroughs, workshops, and orientation seminars—often used to provide basic awareness and training for disaster recovery team members, these exercises describe the contents of DRPs, and other plans, and the roles and responsibilities outlined in those plans.
  • Tabletop exercises—staff “ghost” the same procedures as they would in a disaster, without actually creating disaster conditions or applying or changing anything. These are simple to set up but do not provide any sort of practical evidence of things that could go wrong, time to complete, and so on.
  • Functional exercises—action-based sessions where employees can validate DRPs by performing scenario-based activities in a simulated environment.
  • Full-scale exercises— action-based sessions that reflect real situations, these exercises are held onsite and use real equipment and real personnel as much as possible. Full-scale exercises are often conducted by public agencies, but local organizations might be asked to participate.
81
Q

After-Action Report (AAR)

A

Also identify timescales for disaster plans to be reviewed, to take account of changing circumstances and business needs. Following an incident, it is vital to hold a review meeting to analyze why the incident occurred, what could have been done to prevent it, and how effective was the response?

An After-Action Report (AAR) or “lessons learned” report is a process to determine how effective COOP and DR planning and resources were. An AAR would be commissioned after DR exercises or after an actual incident. In an ideal situation, someone will be delegated the task of recording actions taken and making notes about the progress of the exercise or incident. This is obviously easier in an exercise than a real-life incident though!

The next phase would be to have a post-incident or exercise meeting to discuss implementation of the lessons learned. It is vital that all staff are able to contribute freely and openly to the discussion, so these meetings must avoid apportioning blame and focus on improving procedures. If there are disciplinary concerns in terms of not following procedure, those should be dealt with separately.

The delegated person (or persons) will then complete a report containing a history of the incident, impact assessment, and recommendations for upgrading resources or procedures.

82
Q

Follow these guidelines when selecting business continuity and disaster recovery processes:

A
  • Implement disaster recovery to restore IT operations after a major adverse event.
  • Form a recovery team with multiple job roles and responsibilities.
  • Follow a disaster recovery process from notifying stakeholders to actually beginning recovery.
  • Ensure the DRP includes alternate sites, asset inventory, backup procedures, and other critical information.
  • Ensure that recovery processes are secure from attack or other compromise.
  • Consider maintaining alternate recovery sites to quickly restore operations when the main site is compromised.
  • Choose between a hot, warm, and cold site depending on your business needs and means.
  • Determine an order of restoration to get business-critical systems back online first.
  • Incorporate alternate business practices into the BCP if necessary.
  • Draft a succession plan in case personnel are not available to put the DRP into effect.
  • Choose a data backup type that meets your speed, reliability, and storage needs.
  • Ensure that backups are stored in a secure location.
  • Consider the security implications of maintaining multiple backups.
  • Regularly test the integrity of your backups.
  • Consider placing backups offsite to mitigate damage to a particular location.
  • Be aware of the advantages and disadvantages of close vs. distant backup sites.
  • Research the legal and data sovereignty issues affecting regions where your backup sites are located.
  • Conduct testing exercises to prepare personnel for executing the DRP.
  • Draft AARs to learn from your successes and mistakes.
  • Ask yourself key questions about the event to identify areas for improvement.
  • Modify the DRP as needed in response to lessons learned.
83
Q

forensics

A

Computer forensics is the practice of collecting evidence from computer systems to a standard that will be accepted in a court of law. It is unlikely that a computer forensic professional will be retained by an organization, so such investigations are normally handled by law enforcement agencies. In some cases, however, an organization may conduct a forensic investigation without the expectation of legal action.

Law enforcement agencies will prioritize the investigation of the crime over business continuity. This can greatly compromise the recovery process, especially in smaller businesses, as an organization’s key assets may be taken as evidence.

84
Q

electronically stored information (ESI)

A

Like DNA or fingerprints, digital evidence—often referred to as electronically stored information (ESI)—is mostly latent. Latent means that the evidence cannot be seen with the naked eye; rather, it must be interpreted using a machine or process. Forensic investigations are most likely to be launched against crimes arising from insider threats, notably fraud or misuse of equipment (to download or store obscene material, for instance). Prosecuting external threat sources is often extremely difficult, as the attacker may well be in a different country or have taken effective steps to disguise his or her location and identity. Such prosecutions are normally initiated by law enforcement agencies, where the threat is directed against military or governmental agencies or is linked to organized crime. Cases can take years to come to trial.

85
Q

Due process

A

term used in US and UK common law to require that people only be convicted of crimes following the fair application of the laws of the land. More generally, due process can be understood to mean having a set of procedural safeguards to ensure fairness. This principle is central to forensic investigation. If a forensic investigation is launched (or if one is a possibility), it is important that technicians and managers are aware of the processes that the investigation will use. It is vital that they are able to assist the investigator and that they not do anything to compromise the investigation. In a trial, defense counsel will try to exploit any uncertainty or mistake regarding the integrity of evidence or the process of collecting it.

The first response period following detection and notification is often critical. To gather evidence successfully, it is vital that staff do not panic or act without thinking.

86
Q

Legal hold

A

refers to the fact that information that may be relevant to a court case must be preserved. Information subject to legal hold might be defined by regulators or industry best practice, or there may be a litigation notice from law enforcement or lawyers pursuing a civil action. This means that computer systems may be taken as evidence, with all the obvious disruption to a network that entails.

87
Q

eDiscovery

A

a means of filtering the relevant evidence produced from all the data gathered by a forensic examination and storing it in a database in a format such that it can be used as evidence in a trial. eDiscovery software tools have been produced to assist this process.

88
Q

Some of the functions of eDiscovery suites are:

A
  • Identify and de-duplicate files and metadata—many files on a computer system are “standard” installed files or copies of the same file. eDiscovery filters these types of files, reducing the volume of data that must be analyzed.
  • Search—allow investigators to locate files of interest to the case. As well as keyword search, software might support semantic search. Semantic search matches keywords if they correspond to a particular context.
  • Security—at all points evidence must be shown to have been stored, transmitted, and analyzed without tampering.
  • Disclosure—an important part of trial procedure is that the same evidence be made available to both plaintiff and defendant. eDiscovery can fulfill this requirement. Recent court cases have required parties to a court case to provide searchable ESI rather than paper records.
89
Q

document the scene

A

The first phase of a forensic investigation is to document the scene. The crime scene must be thoroughly documented using photographs and ideally audio and video. Investigators must record every action they take in identifying, collecting, and handling evidence.

Note: Remember that if the matter comes to trial, the trial could take place months or years after the event. It is vital to record impressions and actions in notes.

If possible, evidence is gathered from the live system (including screenshots of display screens and the contents of cache and system memory) using forensic software tools. It is vital that these tools do nothing to modify the digital data that they capture.

Note: Also consider that in-place CCTV systems or webcams might have captured valuable evidence.

As well as digital evidence, an investigator should interview witnesses to establish what they were doing at the scene, whether they observed any suspicious behavior or activity, and also to gather information about the computer system. An investigator might ask questions informally and record the answers as notes to gain an initial understanding of the circumstances surrounding an incident. An investigator must ask questions carefully, to ensure that the witness is giving reliable information and to avoid leading the witness to a particular conclusion. Making an audio or video recording of witness statements produces a more reliable record but may make witnesses less willing to make a statement. If a witness needs to be compelled to make a statement, there will be legal issues around employment contracts (if the witness is an employee) and right to legal representation.

90
Q

The general principle is to capture evidence in the order of volatility, from more volatile to less volatile. RFC 3227 sets out the general order as follows:

A
  • CPU registers and cache memory (including cache on disk controllers, GPUs, and so on).
  • Routing table, arp cache, process table, kernel statistics.
  • Memory (RAM).
  • Temporary file systems.
  • Disk.
  • Remote logging and monitoring data.
  • Physical configuration and network topology.
  • Archival media.
91
Q

Coordinated Universal Time (UTC)

A

Different OS and different file systems use different methods to identify the time at which something occurred. The benchmark time is Coordinated Universal Time (UTC), which is essentially the time at the Greenwich meridian. Local time is the time within a particular time zone, which will be offset from UTC by several hours (or in some cases, half hours). The local time offset may also vary if a seasonal daylight saving time is in place.

NTFS uses UTC “internally” but many OS and file systems record time stamps as the local system time. When collecting evidence, it is vital to establish how a timestamp is calculated and note the offset between the local system time and UTC.

Forensics also needs to consider that a computer’s system clock may not be properly synchronized to a valid time source or may have been tampered with. Most computers are configured to synchronize the clock to a Network Time Protocol (NTP) server. Closely synchronized time is important for authentication and audit systems to work properly. The right to modify a computer’s time would normally be restricted to administrator-level accounts (on enterprise networks) and time change events should be logged.

92
Q

Retrospective Network Analysis (RNA) solution

A

On a typical network, sensor and logging systems are not configured to record all network traffic, as this would generate a very considerable amount of data. There are certainly protocol analyzers that can do this job, but few organizations would deploy them continually. Most network appliances, such as firewalls and IDS, do log events, and these are likely to be valuable evidence of an intrusion or security breach. On the other hand, an organization with sufficient IT resources could chose to preserve a huge amount of data. A Retrospective Network Analysis (RNA) solution provides the means to record network events at either a packet header or payload level.

As well as being used in a legal process, forensics has a role to play in cybersecurity. It enables the detection of past intrusions or ongoing but unknown intrusions by close examination of available digital evidence. A famous quote attributed to former Cisco CEO John Chambers illustrates the point: “There are two types of companies: those that have been hacked, and those who don’t know they have been hacked.” Counterintelligence is the process of information gathering to protect against espionage and hacking. In terms of cybersecurity, most counterintelligence information comes from activity and audit logs generated by network appliances and server file systems. Analysis of adversary Techniques, Tactics, and Procedures (TTP) provides information about how to configure and audit active logging systems so that they are most likely to capture evidence of attempted and successful intrusions.

93
Q

Image acquisition

A

the process of obtaining a forensically clean copy of data from a device held as evidence. An image can be acquired from either volatile or non-volatile storage.

94
Q

write blocker

A

To obtain a forensically sound image from non-volatile storage, you need to ensure that nothing you do alters data or metadata (properties) on the source disk or file system. A write blocker assures this process by preventing any data on the disk or volume from being changed by filtering write commands at the driver and OS level. Mounting a drive as read-only is insufficient.

A write blocker can be implemented as a hardware device or as software running on the forensics workstation. For example, the CRU Forensic UltraDock write blocker appliance supports ports for all main host and drive adapter types. It can securely interrogate hard disks to recover file system data, firmware status information, and data written to Host Protected Areas (HPA) and Device Configuration Overlay (DCO) areas. HPA is used legitimately with boot and diagnostic utilities. A DCO is normally used with RAID systems to make different drive models expose the same number of sectors to the OS. Both these areas can be misused to conceal data or malware.

95
Q

cryptographic hash or fingerprint

A

A critical step in the presentation of evidence will be to demonstrate that analysis has been performed on an image of the data that is identical to the data present on the disk and that neither data set has been tampered with. The standard means of proving this is to create a cryptographic hash or fingerprint of the disk contents and of the image subsequently made of it.

96
Q

imaging

A

Once the target disk has been safely attached to the forensics workstation and verified by generating a cryptographic hash of the contents, the next task is to use an imaging utility to obtain a sector-by-sector copy of the disk contents (a forensic duplicate).

97
Q

Forensic Toolkit (FTK)

A

Forensic procedures are assisted by having an appropriate software toolkit. These are programs that provide secure drive imaging, encryption, and data analysis. There are commercial toolkits, such as EnCase (https://www.guidancesoftware.com/encase-forensic) and AccessData’s Forensic Toolkit (FTK) (https://accessdata.com/products-services/forensic-toolkit-ftk), plus free software, such as Autopsy/The Sleuth Kit (https://www.sleuthkit.org/autopsy).

98
Q

timeline

A

It is vital that the evidence collected at the crime scene conform to a valid timeline. Digital information is susceptible to tampering, so access to the evidence must be tightly controlled.

Depending on the strength of evidence required, physical drives taken from the crime scene can be identified, bagged, sealed, and labeled (using tamper-evident bags). It is also appropriate to ensure that the bags have anti-static shielding to reduce the possibility that data will be damaged or corrupted on the electronic media by ElectroStatic Discharge (ESD). Any other physical evidence deemed necessary is also “bagged and tagged.”

99
Q

a chain of custody

A

form records where, when, and who collected the evidence, who subsequently handled it, and where it was stored. The chain of custody must show access to, plus storage and transportation of, the evidence at every point from the crime scene to the court room. Anyone handling the evidence must sign the chain of custody and indicate what they were doing with it.

The evidence should be stored in a secure facility; this not only means access control, but also environmental control, so that the electronic systems are not damaged by condensation, ESD, fire, and other hazards. Similarly, if the evidence is transported, the transport must also be secure.

100
Q

forensics report

A

The purpose of a forensic investigation is to produce a forensics report detailing any matters of interest or potential evidence discovered. All analysis should be performed on a copy of the evidence rather than on the original devices or the secure image created at the crime scene. When analyzing information from hard drives taken as evidence (data recovery), one of the most significant challenges is dealing with the sheer volume of information captured. Within the thousands of files and hundreds of gigabytes there may only be a few items that provide incriminating evidence. Forensic analysis tools help to identify what could be of interest to the forensic examiner.

101
Q

Big Data

A

analysis techniques can assist in this process. Big data refers to large stores of unstructured information. Big data analysis tools use search query like functions to identify patterns and information of interest within unstructured files such as documents and spreadsheets.

The contents of the file, plus analysis of the file metadata, including time stamps, can reveal useful information. As well as examining the information on hard drives, big data techniques can also be used to analyze network traffic. Big data analysis tools oriented towards security and computer intrusion detection and forensics will certainly become more widely available over the next few years.

102
Q

data visualization

A

Big data analysis software often includes data visualization tools. Visualization is a very powerful analysis technique for identifying trends or unusual activity. For example, a graph of network activity will reveal unusually high activity from a particular host much more easily than analysis of the raw data packets. A “tag cloud” (a visual representation of how frequently words or phrases appear in a data store) of the information on a hard drive might reveal clues about malicious behavior that could not be found by examining each file individually.

Third-party investigators need to keep track of the man hours spent on the investigation and note incidental expenses as part of the billing process. The overall cost of an incident and its investigation is important to establish to feed back into risk assessment. It provides quantitative information about the impact of security incidents and the value of security controls. Establishing the true cost of an incident may also be required in a subsequent claim for compensation against the attacker.

103
Q

Follow these guidelines for investigating security incidents:

A
  • Develop or adopt a consistent process for handling and preserving forensic data.
  • Determine if outside expertise is needed, such as a consultant firm.
  • Notify local law enforcement, if needed.
  • Secure the scene, so that the hardware is contained.
  • Collect all the necessary evidence, which may be electronic data, hardware components, or telephony system components.
  • Observe the order of volatility as you gather electronic data from various media.
  • Interview personnel to collect additional information pertaining to the crime.
  • Report the investigation’s findings to the required people.