Lesson 14: Explaining Risk Management and Disaster Recovery Concepts Flashcards
vulnerable business processes
If a company operates with one or more vulnerable business processes, it could result in disclosure, modification, loss, destruction, or interruption of critical data or it could lead to loss of service to customers. Quite apart from immediate financial losses arising from such security incidents, either outcome will reduce a company’s reputation. If a bank lost its trading floor link to its partners, even for an hour, since the organization’s primary function (trading) would be impossible, huge losses may result. Consequently, when planning a network or other IT system, you must consider the impact of data loss and service unavailability on the organization.
Risk management
rocess for identifying, assessing, and mitigating vulnerabilities and threats to the essential functions that a business must perform to serve its customers.
Risk management performed over five phases:
- Identify mission essential functions—mitigating risk can involve a large amount of expenditure, so it is important to focus efforts. Part of risk management is to analyze workflows and identify the mission essential functions that could cause the whole business to fail if they are not performed. Part of this process also involves identifying critical systems and assets that support these functions.
- Identify vulnerabilities—for each function or workflow (starting with the most critical), analyze systems and assets to discover and list any vulnerabilities or weaknesses to which they may be susceptible. Vulnerability refers to a specific flaw or weakness that could be exploited to overcome a security system.
- Identify threats—for each function or workflow, identify the threats that may take advantage of or exploit or accidentally trigger vulnerabilities. Threat refers to the sources or motivations of people and things that could cause loss or damage.
- Analyze business impacts—the likelihood of a vulnerability being activated as a security incident by a threat and the impact of that incident on critical systems give factors for evaluating risks. There are quantitative and qualitative methods of analyzing impacts.
- Identify risk response—for each risk, identify possible countermeasures and assess the cost of deploying additional security controls. Most risks require some sort of mitigation, but other types of response might be more appropriate for certain types and level of risks.
mission essential function (MEF)
one that cannot be deferred. This means that the organization must be able to perform the function as close to continually as possible, and if there is any service disruption, the mission essential functions must be restored first.
Analysis of mission essential functions is generally governed by four main metrics:
- Maximum tolerable downtime (MTD) is the longest period of time that a business function outage may occur for without causing irrecoverable business failure. Each business process can have its own MTD, such as a range of minutes to hours for critical functions, 24 hours for urgent functions, 7 days for normal functions, and so on. MTDs vary by company and event. Each function may be supported by multiple systems and assets. The MTD sets the upper limit on the amount of recovery time that system and asset owners have to resume operations. For example, an organization specializing in medical equipment may be able to exist without incoming manufacturing supplies for three months because it has stockpiled a sizeable inventory. After three months, the organization will not have sufficient supplies and may not be able to manufacture additional products, therefore leading to failure. In this case, the MTD is three months.
- Recovery time objective (RTO) is the period following a disaster that an individual IT system may remain offline. This represents the amount of time it takes to identify that there is a problem and then perform recovery (restore from backup or switch in an alternative system, for instance).
- Work Recovery Time (WRT). Following systems recovery, there may be additional work to reintegrate different systems, test overall functionality, and brief system users on any changes or different working practices so that the business function is again fully supported.
Note: RTO+WRT must not exceed MTD!
• Recovery Point Objective (RPO) is the amount of data loss that a system can sustain, measured in time. That is, if a database is destroyed by a virus, an RPO of 24 hours means that the data can be recovered (from a backup copy) to a point not more than 24 hours before the database was infected.
For example, a customer leads database might be able to sustain the loss of a few hours’ or days’ worth of data (the salespeople will generally be able to remember who they have contacted and re-key the data manually). Conversely, order processing may be considered more critical, as any loss will represent lost orders and it may be impossible to recapture web orders or other processes initiated only through the computer system, such as linked records to accounting and fulfilment.
MTD and RPO help to determine which business functions are critical and also to specify appropriate risk countermeasures. For example, if your RPO is measured in days, then a simple tape backup system should suffice; if RPO is zero or measured in minutes or seconds, a more expensive server cluster backup and redundancy solution will be required.
For most businesses, the most critical functions will be those that enable customers to find them and for the business to interact with those customers. In practical terms, this means telecoms and web presence. Following that is probably the capability to fulfil products and services. Back-office functions such as accounting, HR, and marketing are probably necessary rather than critical.
identification of critical systems
To support the resiliency of mission essential and primary business functions, it is crucial for an organization to perform the identification of critical systems. This means compiling an inventory of its business processes and its tangible and intangible assets and resources. These could include:
- People (employees, visitors, and suppliers).
- Tangible assets (buildings, furniture, equipment and machinery (plant), ICT equipment, electronic data files, and paper documents).
- Intangible assets (ideas, commercial reputation, brand, and so on).
- Procedures (supply chains, critical procedures, standard operating procedures).
It is important to be up to date with best practice and standards relevant to the type of business or organization. This can help to identify procedures or standards that are not currently being implemented but should be. Make sure that the asset identification process captures system architecture as well as individual assets (that is, understand and document the way assets are deployed, utilized, and how they work together).
business process analysis (BPA)
For mission essential functions, it is important to reduce the number of dependencies between components. Dependencies are identified by performing a business process analysis (BPA) for each function.
The BPA should identify the following factors:
- Inputs—the sources of information for performing the function (including the impact if these are delayed or out of sequence).
- Hardware—the particular server or data center that performs the processing.
- Staff and other resources supporting the function.
- Outputs—the data or resources produced by the function.
- Process flow—a step-by-step description of how the function is performed.
Reducing dependencies makes it easier to provision redundant systems to allow the function to failover to a backup system smoothly. This means the system design can more easily eliminate the sort of weakness that comes from having single points of failure (SPoF) that can disrupt the function.
Key performance indicators (KPI)
Each IT system will be supported by assets, such as servers, disk arrays, switches, routers, and so on. Key performance indicators (KPI) can be used to determine the reliability of each asset.
Some of the main KPIs relating to service availability are as follows:
- Mean Time to Failure (MTTF) and Mean Time Between Failures (MTBF) represent the expected lifetime of a product. MTTF should be used for non-repairable assets. For example, a hard drive may be described with an MTTF, while a server (which could be repaired by replacing the hard drive) would be described with an MTBF. You will often see MTBF used indiscriminately, however. For most devices, failure is more likely early and late in life, producing the so-called “bathtub curve.”
- The calculation for MTBF is the total time divided by the number of failures. For example, if you have 10 devices that run for 50 hours and two of them fail, the MTBF is 250 hours/failure (10*50)/2.
- The calculation for MTTF for the same test is the total time divided by the number of devices, so (10*50)/10, with the result being 50 hours/failure.
MTTF/MTBF can be used to determine the amount of asset redundancy a system should have. A redundant system can failover to another asset if there is a fault and continue to operate normally. It can also be used to work out how likely failures are to occur.
• Mean Time to Repair (MTTR) is a measure of the time taken to correct a fault so that the system is restored to full operation. This can also be described as mean time to “replace” or “recover.” This metric is important in determining the overall Recovery Time Objective (RTO).
asset management process
An asset management process takes inventory of and tracks all the organization’s critical systems, components, devices, and other objects of value. It also involves collecting and analyzing information about these assets so that personnel can make more informed changes or otherwise work with assets to achieve business goals. There are many software suites and associated hardware solutions available for tracking and managing assets (or inventory). An asset management database can be configured to store as much or as little information as is deemed necessary, though typical data would be type, model, serial number, asset ID, location, user(s), value, and service information. Tangible assets can be identified using a barcode label or Radio Frequency ID (RFID) tag attached to the device (or more simply, using an identification number). An RFID tag is a chip programmed with asset data. When in range of a scanner, the chip activates and signals the scanner. The scanner alerts management software to update the device’s location. As well as asset tracking, this allows the management software to track the location of the device, making theft more difficult.
Within the inventory of assets and business processes, it is important to assess their relative importance. In the event of a disaster that requires that recovery processes take place over an extended period, critical systems must be prioritized over merely necessary ones.
It is also important to realize that asset management procedures can easily go astray—assets get mislabeled, new assets are not recorded, and so on. In these cases, some troubleshooting tactics can include:
- Ensure that all relevant assets are participating in a tracking system like barcodes or passive radio frequency IDs (RFIDs).
- Ensure that there is a process in place for tagging newly acquired or developed assets.
- Ensure that there is a process in place for removing obsolete assets from the system.
- Check to see if any assets have conflicting IDs.
- Check to see if any assets have inaccurate metadata.
- Ensure that asset management software can correctly read and interpret tracking tags.
- Update asset management software to fix any bugs or security issues.
Threat assessment
means compiling a prioritized list of probable and possible threats. Some of these can be derived from the list of assets (that is, threats that are specific to your organization); others may be non-specific to your particular organization.
important to note that threats could be created by something that the organization is not doing or an asset that it does not own as much as they can from things that it is doing or assets it does own. Consider (for instance) the impact on business processes of the following:
- Public infrastructure (transport, utilities, law and order).
- Supplier contracts (security of supply chain).
- Customer’s security (the sudden failure of important customers due to their own security vulnerabilities can be as damaging as an attack on your own organization).
- Epidemic disease.
A large part of threat assessment will identify human threat actors, both internal and external to the organization, so try to understand their motives to assess the level of risk that each type of threat actor poses. Threat actors discussed earlier—such as hackers, organized crime, nation state actors, and insider threat—can all be described as working with some sort of intent. Another threat source is the all-too-human propensity for carelessness and, consequently, accidental damage. Misuse of a system by a naïve user may not intend harm but can nonetheless cause widespread disruption. Misconfiguration of a system can create vulnerabilities that might be exploited by other threat agents. Threat actors also need not be human.
Threat awareness must consider threats posed by events such as natural disasters, accidents, and by legal liabilities:
- Natural disaster—threat sources such as river or sea floods, earthquakes, storms, and so on. Natural disasters may be quite predictable (as is the case with areas prone to flooding or storm damage) or unexpected, and therefore difficult to plan for.
- Manmade disaster—intentional man-made threats such as terrorism, war, or vandalism/arson or unintentional threats, such as user error or information disclosure through social media platforms.
- Environmental—those caused by some sort of failure in the surrounding environment. These could include power or telecoms failure, pollution, or accidental damage (including fire).
- Legal and commercial—some examples include:
- Downloading or distributing obscene material.
- Defamatory comments published on social networking sites.
- Hijacked mail or web servers used for spam or phishing attacks.
•
Third-party liability for theft or damage of personal data.
• Accounting and regulatory liability to preserve accurate records.
These cases are often complex, but even if there is no legal liability, the damage done to the organization’s reputation could be just as serious.
supply chain
Threat assessment should not be confined to analyzing your own business. You must also consider critical suppliers. A supply chain is a series of companies involved in fulfilling a product. Assessing a supply chain involves determining whether each link in the chain is sufficiently robust. Each supplier in the chain may have their own suppliers, and assessing “robustness” means obtaining extremely privileged company information. Consequently, assessing the whole chain is an extremely complex process and is an option only available to the largest companies. Most businesses will try to identify alternative sources for supplies so that the disruption to a primary supplier does not represent a single point of failure.
For each business process and each threat, you must assess the degree of risk that exists. Calculating risk is complex, but the two main variables are likelihood and impact:
- Likelihood is the probability of the threat being realized.
- Impact is the severity of the risk if realized as a security incident. This may be determined by factors such as the value of the asset or the cost of disruption if the asset is compromised.
Business impact analysis (BIA)
process of assessing what losses might occur for each threat scenario. For instance, if a roadway bridge crossing a local river is washed out by a flood and employees are unable to reach a business facility for five days, estimated costs to the organization need to be assessed for lost manpower and production. Impacts can be categorized in several ways.
impacts on life and safety
The most critical type of impact is one that could lead to loss of life or critical injury. The most obvious risks to life and safety come from natural disasters, man-made disasters, and accidents (such as fire). Sometimes industries have to consider life and safety impacts in terms of the security of their products, however. For example, a company makes wireless adapters, originally for use with laptops. The security of the firmware upgrade process is important, but it has no impact on life or safety. The company, however, earns a new contract to supply the adapters to provide connectivity for in-vehicle electronics systems. Unknown to the company, a weakness in the design of the in-vehicle system allows an adversary to use compromised wireless adapter firmware to affect the car’s control systems (braking, acceleration, and steering). The integrity of the upgrade process now has an impact on safety.
impacts on property
Again, risks whose impacts affect property (premises) mostly arise due to natural disaster, war/terrorism, and fire.
impacts on finance and reputation
It is important to realize that the value of an asset does not refer solely to its material value. The two principal additional considerations are direct costs associated with the asset being compromised (downtime) and consequent costs to intangible assets, such as the company’s reputation. For example, a server may have a material cost of a few hundred dollars. If the server were stolen, the costs incurred from not being able to do business until it can be recovered or replaced could run to thousands of dollars. In addition, that period of interruption where orders cannot be taken or go unfulfilled leads customers to look at alternative suppliers, resulting in perhaps more thousands of lost sales and goodwill.
impacts on privacy
Another important source of risk is the unauthorized disclosure of personally identifiable information (PII). The theft or loss of PII can have an enormous impact on an individual because of the risk of identity theft and because once disclosed, the PII cannot easily be changed or recovered.
Organizations should perform regular audits to assess whether PII is processed securely. These may be modelled on formal audit documents mandated by US laws, notably The Privacy Act and the Federal Information Security Management Act (FISMA):
- Privacy Threshold Analysis (PTA)—An initial audit to determine whether a computer system or workflow collects, stores, or processes PII to a degree where a PIA must be performed. PTAs must be repeated every three years.
- Privacy Impact Assessment (PIA)—A detailed study to assess the risks associated with storing, processing, and disclosing PII. The study should identify vulnerabilities that may lead to data breach and evaluate controls mitigating those risks.
- System of Records Notice (SORN)—A formal document listing PII maintained by a federal agency of the US government.
There are two methods of assessing likelihood and risk:
quantitative and qualitative
Quantitative risk assessment aims to assign concrete values to each risk factor.
- Single Loss Expectancy (SLE)—The amount that would be lost in a single occurrence of the risk factor. This is determined by multiplying the value of the asset by an Exposure Factor (EF). EF is the percentage of the asset value that would be lost.
- Annual Loss Expectancy (ALE)—The amount that would be lost over the course of a year. This is determined by multiplying the SLE by the Annual Rate of Occurrence (ARO).
The problem with quantitative risk assessment is that the process of determining and assigning these values is complex and time consuming. The accuracy of the values assigned is also difficult to determine without historical data (often, it has to be based on subjective guesswork). However, over time and with experience, this approach can yield a detailed and sophisticated description of assets and risks and provide a sound basis for justifying and prioritizing security expenditure.
Qualitative risk assessment
avoids the complexity of the quantitative approach and is focused on identifying significant risk factors. The qualitative approach seeks out people’s opinions of which risk factors are significant. Assets and risks may be placed in simple categories. For example, assets could be categorized as Irreplaceable, High Value, Medium Value, and Low Value; risks could be categorized as one-off or recurring and as Critical, High, Medium, and Low probability.
Another simple approach is the “Traffic Light” impact grid. For each risk, a simple Red, Yellow, or Green indicator can be put into each column to represent the severity of the risk, its likelihood, cost of controls, and so on. This approach is simplistic but does give an immediate impression of where efforts should be concentrated to improve security.
FIPS 199 (https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.199.pdf) discusses how to apply Security Categorizations (SC) to information systems based on the impact that a breach of confidentiality, integrity, or availability would have on the organization as a whole. Potential impacts can be classified as:
- Low—minor damage or loss to an asset or loss of performance (though essential functions remain operational).
- Moderate—significant damage or loss to assets or performance.
- High—major damage or loss or the inability to perform one or more essential functions.
Having performed the asset and threat identification and completed a risk assessment, risk response options can be identified and prioritized. For example, you might focus on the following systems:
- High value asset, regardless of the likelihood of the threat(s).
- Threats with high likelihood (that is, high ARO).
- Procedures, equipment, or software that increase the likelihood of threats (for example, legacy applications, lack of user training, old software versions, unpatched software, running unnecessary services, not having auditing procedures in place, and so on).
In theory, security controls or countermeasures could be introduced to address every vulnerability. The difficulty is that security controls can be expensive, so you must balance the cost of the control with the cost associated with the risk.
Risk mitigation (or remediation)
It is not often possible to eliminate risk; rather the aim is to mitigate risk factors to the point where the organization is exposed only to a level of risk that it can afford (residual risk). Risk mitigation (or remediation) is the overall process of reducing exposure to or the effects of risk factors. There are several ways of mitigating risk. If you deploy a countermeasure that reduces exposure to a threat or vulnerability that is risk deterrence (or reduction). Risk reduction refers to controls that can either make a risk incident less likely or less costly (or perhaps both). For example, if fire is a threat, a policy strictly controlling the use of flammable materials on site reduces likelihood while a system of alarms and sprinklers reduces impact by (hopefully) containing any incident to a small area. Another example is offsite data backup, which provides a remediation option in the event of servers being destroyed by fire.
Other risk response strategies are as follows:
• Avoidance means that you stop doing the activity that is risk-bearing.
For example, a company may develop an in-house application for managing inventory and then try to sell it. If while selling it, the application is discovered to have numerous security vulnerabilities that generate complaints and threats of legal action, the company may make the decision that the cost of maintaining the security of the software is not worth the revenue and withdraw it from sale.
Obviously, this would generate considerable bad feeling amongst existing customers. Avoidance is not often a credible option.
• Transference (or sharing) means assigning risk to a third party (such as an insurance company or a contract with a supplier that defines liabilities). For example, a company could stop in-house maintenance of an e‑commerce site and contract the services to a third party, who would be liable for any fraud or data theft.
Note: Note that in this sort of case, it is relatively simple to transfer the obvious risks, but risks to the company’s reputation remain. If a customer’s credit card details are
stolen because they used your unsecure e‑commerce application, the customer won’t care if you or a third party were nominally responsible for security. It is also unlikely that legal liabilities could be completely transferred in this way.
• Acceptance (or retention) means that no countermeasures are put in place either because the level of risk does not justify the cost or because there will be unavoidable delay before the countermeasures are deployed. In this case, you should continue to monitor the risk (as opposed to ignoring it).
risk register
document showing the results of risk assessments in a comprehensible format. The register may resemble the “traffic light” grid shown earlier with columns for impact and likelihood ratings, date of identification, description, countermeasures, owner/route for escalation, and status. Risk registers are also commonly depicted as scatterplot graphs, where impact and likelihood are each an axis, and the plot point is associated with a legend that includes more information about the nature of the plotted risk. A risk register should be shared between stakeholders (executives, department managers, and senior technicians) so that they understand the risks associated with the workflows that they manage.
The need to change is often described either as reactive or proactive
In order to reduce the risk that changes to configuration items will cause service disruption, a documented change management process can be used to implement changes in a planned and controlled way. The need to change is often described either as reactive, where the change is forced on the organization, or as proactive, where the need for change is initiated internally. Changes can also be categorized according to their impact and level of risk (major, significant, minor, or normal, for instance).
Request for Change (RFC) document
In a formal change management process, the need for change and the procedure for implementing the change is captured in a Request for Change (RFC) document and submitted for approval. The RFC will then be considered at the appropriate level. This might be a supervisor or department manager if the change is normal or minor. Major or significant changes might be managed as a separate project and require approval through a Change Advisory Board (CAB).
Follow these guidelines when putting risk management processes in place:
- Identify mission-essential functions and the critical systems within each function.
- Identify those assets supporting business functions and critical systems, and determine their values.
- Calculate MTD, RPO, RTO, MTTF, MTTR, and MTBF for functions and assets.
- Look for possible vulnerabilities that, if exploited, could adversely affect each function or system.
- Determine potential threats to functions and systems.
- Determine the probability or likelihood of a threat exploiting a vulnerability.
- Determine the impact of the potential threat, whether it be recovery from a failed system or the implementation of security controls that will reduce or eliminate risk.
- Identify impact scenarios that put your business operations at risk.
- Identify the risk analysis method that is most appropriate for your organization. For quantitative and semi-quantitative risk analysis, calculate SLE and ARO for each threat, and then calculate the ALE.
- Identify potential countermeasures, ensuring that they are cost-effective and perform as expected. For example, identify single points of failure and, where possible, establish redundant or alternative systems and solutions.
- Clearly document all findings discovered and decisions made during the assessment in a risk register.
Continuity of Operations Planning (COOP), sometimes referred to as a business continuity plan (BCP)
a collection of processes that enable an organization to maintain normal business operations in the face of some adverse event. There are numerous types of events, both natural and man-made, that could disrupt the business and require a continuity effort to be put in place. They may be instigated by a malicious party, or they may come about due to careless or negligence on the part of non-malicious personnel. The organization may suffer loss or leakage of data; damage to or destruction of hardware and other physical property; impairment of communications infrastructure; loss of or harm done to personnel; and more. When these negative events become a reality, the organization will need to rely on resiliency and automation strategies to mitigate their effect on day-to-day operations.
single points of failure
Computer systems require protection from hardware failure, software failure, and system failure (failure of network connectivity devices, for instance).
When implementing a network, the goal will always be to minimize the single points of failure and to allow ongoing service provision despite a disaster. To perform IT Contingency Planning (ITCP), think of all the things that could fail, determine whether the result would be a critical loss of service, and whether this is unacceptable. Then identify strategies to make the system resilient. How resilient a system is can be determined by measuring or evaluating several properties.
high availability
One of the key properties of a resilient system is high availability. Availability is the percentage of time that the system is online, measured over the defined period (typically one year). The corollary of availability is downtime (that is, the percentage or amount of time during which the system is unavailable). The maximum tolerable downtime (MTD) metric states the requirement for a particular business function. High availability is usually loosely described as 24x7 (24 hours per day, 7 days per week) or 24x365 (24 hours per day, 365 days per year). For a critical system, availability will be described as “two-nines” (99%) up to five- or six-nines (99.9999%).
fault tolerant
A system that can experience failures and continue to provide the same (or nearly the same) level of service is said to be fault tolerant. Fault tolerance is often achieved by provisioning redundancy for critical components and single points of failure. A redundant component is one that is not essential to the normal function of a system but that allows the system to recover from the failure of another component.
Examples of devices and solutions that provide fault tolerance include the following:
- Redundant components (power supplies, network cards, drives (RAID), and cooling fans) provide protection against hardware failures. Hot swappable components allow for easy replacement (without having to shut down the server).
- Uninterruptible Power Supplies (UPS) and Standby Power Supplies.
- Backup strategies—provide protection for data.
- Cluster services are a means of ensuring that the total failure of a server does not disrupt services generally.
While these computer systems are important, thought also needs to be given about how to make a business “fault tolerant” in terms of staffing, utilities (heat, power, communications, transport), customers, and suppliers.
Scalability
A resilient system does not just need to be able to cope with faults and outages, but it must also be able to cope with changing demand levels. These properties are measured as scalability and elasticity
means that the costs involved in supplying the service to more users are linear. For example, if the number of users doubles in a scalable system, the costs to maintain the same level of service would also double (or less than double). If costs more than double, the system is less scalable.
To scale out is to add more resources in parallel with existing resources. To scale up is to increase the power of existing resources.
Elasticity
A resilient system does not just need to be able to cope with faults and outages, but it must also be able to cope with changing demand levels. These properties are measured as scalability and elasticity
refers to the system’s ability to handle changes in demand in real time. A system with high elasticity will not experience loss of service or performance if demand suddenly doubles (or triples, or quadruples). Conversely, it may be important for the system to be able to reduce costs when demand is low. Elasticity is a common selling point for cloud services. Instead of running a cloud resource for 24 hours a day, 7 days a week, that resource can diminish in power or shut down completely when demand for that resource is low. When demand picks up again, the resource will grow in power to the level required. This results in cost-effective operations.