Lesson 20 Flashcards
High Availability
High
availability is usually loosely described as 24x7 (24 hours per day, 7 days per week) or
24x365 (24 hours per day, 365 days per year). For a critical system, availability will be
described as “two-nines” (99%) up to five- or six-nines (99.9999%):
Availability
Annual Downtime
99.9999%
00:00:32
99.999%
00:05:15
99.99%
00:52:34
99.9%
08:45:36
99.0%
87:36
Maximum tolerable downtime (MTD)
The maximum tolerable downtime (MTD) metric
expresses the availability requirement for a particular business function.
High
availability is usually loosely described as 24x7 (24 hours per day, 7 days per week) or
24x365 (24 hours per day, 365 days per year). For a critical system, availability will be
described as “two-nines” (99%) up to five- or six-nines (99.9999%):
Scheduled service intervals versus unplanned outages
Downtime is calculated from the sum of scheduled service intervals plus unplanned outages over
the period.
Scalability
Scalability
•Increase capacity within similar cost ratio
•Scale out versus scale up
Scalability is the capacity to
increase resources to meet demand within similar cost ratios. This means that if service
demand doubles, costs do not more than double. There are two types of scalability:
• To scale out is to add more resources in parallel with existing resources.
• To scale up is to increase the power of existing resources.
Elasticity
Elasticity
•Cope with changes to demand in real time
Elasticity refers to the system’s ability to handle these changes on demand in real
time. A system with high elasticity will not experience loss of service or performance if
demand suddenly increases rapidly.
Fault tolerance and redundancy
A system that can experience failures and continue to provide the same (or nearly the
same) level of service is said to be fault tolerant. Fault tolerance is often achieved
by provisioning redundancy for critical components and single points of failure. A
redundant component is one that is not essential to the normal function of a system
but that allows the system to recover from the failure of another component.
Power problems
•Spikes and surges
•Blackouts and brownouts
All types of computer systems require a stable power supply to operate. Electrical
events, such as voltage spikes or surges, can crash computers and network appliances,
while loss of power from brownouts or blackouts will cause equipment to fail.
Power management
Powermanagement means deploying systems to ensure that equipment is protected against
these events [blackouts, brownouts, spikes and surges] and that network operations can either continue uninterrupted or be
recovered quickly.
Dual Power Supplies
Dual power supplies
•Component redundancy for server chassis
An enterprise-class server or appliance enclosure is likely to feature two or more power supply units (PSUs) for redundancy. A hot plug PSU can be replaced (in the event of failure) without powering down the system.
Managed power distribution units (PDUs)
Managed power distribution units (PDUs)
•Protection against spikes, surges, and brownouts
•Remote monitoring
The power circuits supplying grid power to a rack, network closet, or server room
must be enough to meet the load capacity of all the installed equipment, plus room
for growth. Consequently, circuits to a server room will typically be higher capacity
than domestic or office circuits (30 or 60 amps as opposed to 13 amps, for instance).
These circuits may be run through a power distribution unit (PDU). These come with
circuitry to “clean” the power signal, provide protection against spikes, surges, and
brownouts, and can integrate with uninterruptible power supplies (UPSs). Managed
PDUs support remote power monitoring functions, such as reporting load and
status, switching power to a socket on and off, or switching sockets on in a particular
sequence.
Battery backups and uninterruptible power supply (UPS)
Battery backups and uninterruptible power supply (UPS)
•Battery backup at component level
•UPS battery backups for servers and appliances
If there is loss of power, system operation can be sustained for a few minutes or hours
(depending on load) using battery backup. Battery backup can be provisioned at the
component level for disk drives and RAID arrays. The battery protects any read or write
operations cached at the time of power loss. At the system level, an uninterruptible
power supply (UPS) will provide a temporary power source in the event of a blackout
(complete power loss). This may range from a few minutes for a desktop-rated model
to hours for an enterprise system. In its simplest form, a UPS comprises a bank of
batteries and their charging circuit plus an inverter to generate AC voltage from the DC
voltage supplied by the batteries.
Generators
A backup power generator can provide power to the whole building, often for several
days. Most generators use diesel, propane, or natural gas as a fuel source. With diesel
and propane, the main drawback is safe storage (diesel also has a shelf-life of between
18 months and two years); with natural gas, the issue is the reliability of the gas
supply in the event of a natural disaster. Data centers are also investing in renewable
power sources, such as solar, wind, geothermal, hydrogen fuel cells, and hydro. The
ability to use renewable power is a strong factor in determining the best site for new
data centers. Large-scale battery solutions, such as Tesla’s Powerpack (tesla.com/
powerpack), may be able to provide an alternative to backup power generators. There
are also emerging technologies to use all the battery resources of a data center as a
microgrid for power storage (scientificamerican.com/article/how-big-batteries-at-datacenters-
could-replace-power-plants/).
Network reducnancy
Networking is another critical resource where the a single point of failure could cause
significant service disruption.
Network Interface Card (NIC) Teaming
Network interface card (NIC) teaming, or adapter teaming, means that the server
is installed with multiple NICs, or NICs with multiple ports, or both. Each port is
connected to separate network cabling. During normal operation, this can provide a
high-bandwidth link. For example, four 1 GB ports gives an overall bandwidth of 4 GB.
If there is a problem with one cable, or one NIC, the network connection will continue
to work, though at just 3 GB.
From Wikipedia: A network interface controller (NIC, also known as a network interface card,[3] network adapter, LAN adapter or physical network interface,[4] and by similar terms) is a computer hardware component that connects a computer to a computer network.[5]
Switching and Routing (for network redundancy
Switching and routing
•Design network with multiple paths
Network cabling should be designed to allow for multiple paths between the various
switches and routers, so that during a failure of one part of the network, the rest
remains operational.
Load balancers (for network reducnancy)
Load balancers
•Load balancing switch to distribute workloads
•Clusters provision multiple redundant servers to share data and session information
NIC teaming provides load balancing at the adapter level. Load balancing and
clustering can also be provisioned at a service level:
• A load balancing switch distributes workloads between available servers.
• A load balancing cluster enables multiple redundant servers to share data and
session information to maintain a consistent service if there is failover from one
server to another.
Disk Redundancy
Disk and storage resources are critically dependent on redundancy. While backup provides
integrity for when a disk fails, to restore from backup would require installing a new
storage unit, restoring the data, and testing the system configuration. Disk redundancy
ensures that a server can continue to operate if one, or possibly more, storage devices fail
Redundant array of independent disks (RAID)
When a storage system is configured as a Redundant Array of Independent Disks
(RAID), many disks can act as backups for each other to increase reliability and fault
tolerance. If one disk fails, the data is not lost, and the server can keep functioning.
The RAID advisory board defines RAID levels, numbered from 0 to 6, where each level
corresponds to a specific type of fault tolerance. There are also proprietary and nested
RAID solutions. Some of the most commonly implemented types of RAID are listed in
the following table.
Raid 1, 5, 6, Nested, and level 0
RAID 1
•Mirroring
•50% storage efficiency
RAID 5 and RAID 6
•Striping with distributed parity
•Better storage efficiency
Nested RAID
•Better performance or redundancy
RAID -
Level 1
Mirroring means that data is written to two
disks simultaneously, providing redundancy
(if one disk fails, there is a copy of data
on the other). The main drawback is that
storage efficiency is only 50%.
Level 5
Striping with parity means that data is
written across three or more disks, but
additional information (parity) is calculated.
This allows the volume to continue if one
disk is lost. This solution has better storage
efficiency than RAID 1.
Level 6
Double parity, or level 5 with an additional
parity stripe, allows the volume to continue
when two devices have been lost
Nested (0+1, 1+0, or 5+0)
Nesting RAID sets generally improves
performance or redundancy. For example,
some nested RAID solutions can support the
failure of more than one disk.
Raid Level 0
RAID level 0 refers to striping without parity. Data is written in blocks across several disks
simultaneously, but with no redundancy. This can improve performance, but if one disk
fails, so does the whole volume, and data on it will be corrupted. There are some use cases
for RAID 0, but typically striping without parity is only implemented to improve performance
in a nested RAID solution.
Multipath
Multipath
•Controller and cabling redundancy
Where RAID provides redundancy for the storage devices, multipath is focused on
the bus between the server and the storage devices or RAID array. A storage system is
accessed via some type of controller. The controller might be connected to disk units
locally installed in a server, or it might connect to storage devices within a storage area
network (SAN). Multipath input/ouput (I/O) ensures that there is controller redundancy
and/or multiple network paths to the storage devices.
Replication context
- Local storage (RAID)
- Storage area network (SAN)
- Database
- Virtual machine (VM)
Data replication is technology that maintains exact copies of data at more than one
location. RAID mirroring and parity implement types of replication between local
storage devices. Data replication can be applied in many other contexts:
• Storage Area Network (SAN)—most enterprise storage is configured as a SAN. A
SAN is a high-speed fiber optic network of storage devices built from technologies
such as Fibre Channel, Small Computer System Interface (SCSI), or Infiniband.
Redundancy can be provided within the SAN, and replication can also take place
between SANs using WAN links.
• Database—much data is stored within a database. Where a database is replicated
between multiple servers or sites, it is very important to maintain consistency
between the replicas. Database management systems come with specific tools to
implement different kinds of replication.
• Virtual Machine (VM)—the same VM instance may need to be deployed in multiple
locations. This can be achieved by replicating the VM’s disk image and configuration
settings.
Geographic dispersal
Geographical dispersal refers to data replicating hot and warm sites that are
physically distant from one another. This means that data is protected against a
natural disaster wiping out storage at one of the sites. This is also described as a georedundant
solution.
Asynchronous and synchronous replication
- Synchronous (must be written at both sites—expensive)
- Asynchronous (one site is primary and the others secondary)
- Optimum distances between sites
Synchronous replication is designed to write data to all replicas simultaneously.
Therefore, all replicas should always have the same data all of the time. Asynchronous
replication writes data to the primary storage first, and then copies data to the replicas
at scheduled intervals.
Asynchronous replication isn’t a good choice for a solution that requires data in
multiple locations to be consistent, such as data from product inventory lists accessed
in different regions. Many geo-redundant replication services rely on asynchronous
replication due to the distances between data centers in multiple regions. In some
cases, business solutions work around the limitations of asynchronous replication. For
example, an online retailer may choose only to show inventory from their local regional
warehouse.
On-Premises versus Cloud
High availability through redundancy and replication is resource-intensive, especially
when configuring multiple hot or warm sites. For on-premises sites, provisioning the
storage devices and high-bandwidth, low-latency WAN links required between two
geographically dispersed hot sites could incur unaffordable costs. This cost is one of the
big drivers of cloud services, where local and geographic redundancy are built into the
system, if you trust the CSP to operate the cloud effectively. For example, in the cloud,
geo-redundancy replicates data or services between data centers physically located
in two different regions. Disasters that occur at the regional level, like earthquakes,
hurricanes, or floods, should not impact availability across multiple zones.
Backups and Retention Policy
- Short term retention
- Version control and recovery from corruption/malware
- Long term retention
- Regulatory/business requirements
- Recovery window
- Recovery point objective (RPO)
•Short term retention
- Short term retention
- Version control and recovery from
In the short term, files that change frequently might need retaining for version
control. Short-term retention is also important in recovering from malware
infection. Consider the scenario where a backup is made on Monday, a file is
infected with a virus on Tuesday, and when that file is backed up later on Tuesday,
the copy made on Monday is overwritten. This means that there is no good means
of restoring the uninfected version of the file. Short-term retention is determined by
how often the youngest media sets are overwritten.
•Long term retention
- Long term retention
* Regulatory/business requirements
•Recovery window
- Recovery window
- Recovery point objective (RPO)
For these reasons, backups are kept back to certain points in time. As backups take up
a lot of space, and there is never limitless storage capacity, this introduces the need for
storage management routines to reduce the amount of data occupying backup storage
media while giving adequate coverage of the required recovery window. The recovery
window is determined by the recovery point objective (RPO), which is determined
through business continuity planning. Advanced backup software can prevent media
sets from being overwritten in line with the specified retention policy.
Backup Typers
See slide or guide. there is a graphic.
Full
Incremental
Differential
•Snapshots
- Snapshots
- Feature of file system allowing open file copy
- Volume Shadow Copy Service (VSS)
- VM snapshots and checkpoints
- Image-based backup
- System images
Snapshots are a means of getting around the problem of open files. If the data that
you’re considering backing up is part of a database, such as SQL data or an Exchange
messaging system, then the data is probably being used all the time. Often copy-based
mechanisms will be unable to back up open files. Short of closing the files, and so too
the database, a copy-based system will not work. A snapshot is a point-in-time copy
of data maintained by the file system. A backup program can use the snapshot rather
than the live data to perform the backup. In Windows, snapshots are provided for on
NTFS volumes by the Volume Shadow Copy Service (VSS). They are also supported on
Sun’s ZFS file system, and under some enterprise distributions of Linux.
Virtual system managers can usually take snapshot or cloned copies of VMs. A
snapshot remains linked to the original VM, while a clone becomes a separate VM from
the point that the cloned image was made.
An image backup is made by duplicating an OS installation. This can be done either
from a physical hard disk or from a VM’s virtual hard disk. Imaging allows the system
to be redeployed quickly, without having to reinstall third-party software, patches, and
configuration settings. A system image should generally not contain any user data files,
as these will quickly become out of date.