3.4 Flashcards
network load balancer
a device that is used to evenly distribute incoming network traffic across multiple servers or resources when there is a high volume of traffic coming into the company’s network or web server
load balancer (info)
They distribute Transmission Control Protocol (TCP),
User Datagram Protocol (UDP), Hypertext Transfer
Protocol (HTTP), and Transport Layer Security (TLS)
traffic across multiple servers to efficiently allocate
resources and offer failover solutions
clustering
intended to improve performance and availability of a complex physical or virtual system
involves an active node and a passive node that share a common quorum disk, reinforced by a witness server, heartbeat communication and a VIP at the forefront
Clusters are designed to be a redundant set of service functionalities based on active-standby or active-active deployments
Cluster deployments are often measured by:
- Reliability – the ability to successfully provide responses on each incoming request
- Availability – the uptime of the server (usually measured as % of annual uptime)
- Performance the average of the time spent by the service to provide responses or by the throughput
- Scalability – the ability to handle a growing amount of work in a capable manner without degradation in the quality of service
clustering techniques
High availability clusters
Load balancing clusters
High-performance clusters
Storage clusters
High availability clusters
prioritize resilience over
other advantages and can be implemented in either
Active-Passive or Active-Active architecture
Load balancing clusters
highlight balancing the jobs
among all of the servers in the cluster and incorporate load balancing software in the controller node
High-performance clusters
use multiple servers to
execute a specific task very quickly and support data-intensive projects such as live-streaming and
real-time data processing
Storage clusters
offer massive storage arrays,
sometimes in support of high-performance clusters,
but always in a support role for other servers or clusters such as storage area networking or
hypervisor cluster data stores
Full Backups
- The process backs up everything regardless of whether the archive bit is set or not:
- Clears the archive bit once the backup completes
- This method takes the longest to back up and the time depends on how much must be backed up
- A full backup is quickest to restore as only the most recent full backup is required
- A full backup should be scheduled, automated, and tested although it is common to perform this manually
incremental backups
This method backs up any new file or any file that has changed since
* The last full backup
* The last incremental backup
- Subsequent backups only store changes that were made since the previous backup
- An incremental backup clears the archive bit once the backup completes
- The process of restoring lost data from an incremental backup is longer, but the backup process
is much quicker - In is not recommended to perform incremental backups manually
differential backups
This method backs up any file that has the archive
bit set
* Backs up any new file or any file that has changed since the last full backup
* A differential back up DOES NOT clear the archive bit when the backup completes
* It is slow to back up but quick to restore
* The last full backup and the most recent differential backup are needed for restoration
* It is not recommended to perform differential backups manually
snapshots
Are immediate point-in-time virtual copies of the source data
* Offer easier and faster backups and restores
* Should be replicated to another medium or cloud storage to be considered a backup
* Do not increase time to back up based on amount of data
* Improve Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
* Have fast restores
* Result in less data is lost with an outage
* Can easily be encrypted and decrypted
backup frequency
Backup frequency is often based on the business impact analysis metric known as the Recovery Point
Objective (RPO):
* RPO is the maximum amount of data loss that you
can tolerate in case of a disaster
* The lower the RPO, the more frequently you need to
back up your data
journaling
Journaling is also referred to as journal-based
backup
* Journaling is the simultaneous (real-time) logging of all data-file updates
* This log offers an audit trail and is used to reconstruct the database if the original file is damaged or destroyed
* Journal-based backup is an alternate method of backup that uses a change journal maintained by a hardware or software storage manager
encrypting backup
Encrypting the database and other data backups helps secure the data
Continuity of operations plan (COOP) or business
continuity plan (BCP)
helps to ensure that the entity
remains operational at a pre-determined level when
disaster strikes
- These are plans and documents approved by executive management that:
- Outline the risk to business
- Populate risk register/ledger
- Provide requirements to mitigate incidents
- Identify the procedures needed to recover from a
disaster
business impact analysis (BIA)
Recovery Time Objective (RTO)
Maximum tolerable downtime (MTD)
Recovery Point Objective (RPO)
Mean time to repair (MTTR)
Mean time between failures (MTBF)
Mean time between failures (MTBF):
The number of failures per million hours for a product
Mean time to repair (MTTR):
The average time needed to repair or replace a failed system or module
Recovery Point Objective (RPO):
The maximum targeted period in which an asset or data may be lost from an IT service due to a major
event
Maximum tolerable downtime (MTD):
Absolute maximum amount of time that a resource, service, or function can be unavailable
Recovery Time Objective (RTO):
The target amount of time within which a process must be restored after disruption
Disaster recovery planning (DRP)
- Outlines the technical aspects involved for restoration:
- Order of restoration (most critical to least critical)
- Backups, snapshots, and restores
- Contact information
- Communication plans
- Chain of authority
- Step-by-step instructions
- Locations of documents, software, and keys
- Recovery sites: Hot, warm, cold, mobile, cloud,
shared
Multicloud
- Is a cloud computing model where an enterprise leverages a combination of clouds (two or more
public clouds, two or more private clouds, or a combination of public, private, and edge clouds) - Enables the distribution of data, applications, and services to accelerate app transformation and the delivery of new apps
- Supports disaster recovery by leveraging more than one provider for enhanced high availability
and durability
geographic dispersion
- Distance between systems, or geographic dispersion, has benefits but also has physical and
practical limitations - For a disaster recovery solution, typically, the greater the distance between the systems, the greater the protection you will have from areawide disasters
- This distance will come with application environment impacts:
- When distance is added to a data replication solution, latency is introduced
- Latency is the added time it takes for data to reach the target system
capacity planning
- Is a technique for analyzing how much production
capacity organizations need to meet consumer demand - Is widely used in the data center, manufacturing, and cloud services industries
- Assists organizations in governing whether they have enough raw materials, people, technology, and
infrastructure to deliver the value proposition
types of capacity planning
Product
Workforce
Tool
Production
Read-through (plan review)
is where the business
continuity plan owner and business continuity team discuss the business continuity plan:
- Look for missing elements and inconsistencies within
the plan or with the organization - Is a type of checklist test that is useful to train new members of a team, including the business function owner
Tabletop testing
is where participants gather in a
room to execute documented plan activities in a stress-free environment:
- Can use blueprints, topological diagrams, or computer models to effectively demonstrate whether team members know their duties in an emergency and if they need training
- Identifies documentation errors, missing information, and inconsistencies across business continuity plans
Walkthrough testing
g is a planned rehearsal of a
possible incident designed to evaluate an organization’s capability to manage that incident:
* Provides an opportunity to improve the organization’s future responses and enhance the
relevant competencies of those involved
* Is often done on a limited basis or by scheduling each department or building separately for fire and active shooter drills
Simulation testing
determines if business continuity
management procedures and resources work in a realistic situation:
* May be the most elaborate test most entities ever conduct
* Uses established business continuity resources, such
as the recovery site, backup equipment, services from recovery vendors, and transportation
* Can require sending teams to alternate sites to restart technology as well as business functions
A parallel test
involves bringing the recovery site to a state of operational readiness, but maintaining operations at the primary site:
* Staff are relocated, backup tapes are transferred, and operational readiness is established in accordance with the disaster recovery plan, while operations at
the primary site continue normally
* This may be the most comprehensive test most
entities ever conduct
full interruption test
operations are completely shut down at the primary site to fully
emulate the disaster:
* The enterprise transfers to the recovery site in accordance with the disaster recovery plan
* This is a very thorough test, which is also expensive (may be cost-prohibitive)
* The full interruption test has the capacity to cause a major disruption of operations if the test fails
types of power outages
blackout
brownouts
permanent fault
rolling blackouts
blackout
is a complete loss of power to an area:
- This is the most severe type of power outage,
typically affecting large numbers of people over
potentially large areas
brownouts
typically occur if there is a drop in
electrical voltage or a drop in the overall electrical
power supply:
- While brownouts do not cause a complete loss of power, they can cause poor performance from some
equipment and some devices
permanent fault
a sudden loss of power
typically caused by a power line fault:
* These are simple and easy to deal with; once the fault is removed or repaired, power is automatically
restored
rolling blackouts
are different from the other three
as they are planned power outages:
* These are usually implemented in areas with unstable grids or with infrastructure that cannot
handle the population it serves
* Rolling blackouts can also be caused if there’s not enough fuel to run power at full capacity, whether for the short-term or long-term
uninterruptible power supply (UPS)
an electrical component that delivers emergency power to a load when the main power source
(typically utility power) fails
It conditions incoming power to ensure clean and uninterrupted power, protects devices from power
problems, and enables seamless system shutdown during complete outages
- A UPS system is particularly beneficial for networking equipment and other devices that can lose data when power is suddenly lost
- The UPS is a critical investment to thwart damage, data loss, and downtime caused by power issues
generators
- A backup generator is a failover power solution that provides power to business operations and homes
- They are typically stationary and require a concrete pad used as a foundation usually situated outside a facility or site
- Standby generators are a robust solution that can offer power for days during extended power outages, depending on the fuel type
and configuration of the generator - Many sites employ prime or continuous generators for disaster recovery site solutions
multiple power sources (info)
- Electricity companies can operate in the same area because they can compete to provide electricity to
consumers - While the power may come from the same grid or transmission lines, different companies can
generate and supply electricity to the grid - These companies then compete based on factors such as pricing, customer service, and renewable
energy offerings - It is similar to how different phone carriers can operate using the same cell towers and infrastructure
mobile site recovery time
24 to 48 hours
mobile site advantages
- Moderately priced
- Typically, can be in place for
36 to 72 hours - Can be placed in the parking
lot adjacent to your impacted
facility
mobile site disadvantages
- Recovery time typically is at least 2 to 5 days
longer than a hot site - Access to the impacted facility may be hindered
because of the event - A trailer may not be configured exactly as you
need it
mobile site (info)
This approach avoids employee travel issues but has limitations on equipment availability and outbound bandwidth if small aperture satellite terminal (VSAT) links must be used for communication. If the disaster
profile includes events such as hurricanes, floods or toxic spills, these solutions may not be appropriate.
cold site recovery time
72 plus hours
cold site advantages
- Lowest cost solution
- Basic infrastructure power, air,
and communication are in
place - Can rent the facility for a
longer term at lower cost
cold site disadvantages
- This has the longest recovery time
- All equipment must be ordered, delivered,
installed, and made operational - This is the worst solution for supporting ongoing
operations
cold site (info)
An environmentally appropriate space can be either
provisioned internally or contracted from a commercial facilities service provider. Cold-site strategies are usually based on “quick-ship” delivery agreements to allow server, storage, and communications hardware and network service
providers to quickly build out the data center and/or client workspace infrastructure.
Reciprocal agreement recovery time
12 to 48 hours
Reciprocal agreement advantages
- Least costly solution
- Better than no strategy
Reciprocal agreement disadvantages
- Reciprocal agreement seldom works
- Typically, organizations are in the same geographic
area and a wide-range disaster like an earthquake renders it of no use - There is no easy way to test
Reciprocal agreement (info)
This is typically a formal agreement between two trusted, non-competing partners in different industries in which each provides secure sites for the other. This
option is the least favorable and has the greatest risk
associated with it.
Cloud recovery time
0 to 24 hours
cloud advantages
- Data and applications
available immediately - Location independent
- Easy to test
cloud disadvantages
- Security is a concern
- Cloud may not allow enough time for a daily cycle
processing window
Cloud (info)
Data should be in place so activation would only be limited by connectivity and network addressing (DNS propagation)