3.4 Resiliency Flashcards by Charlie Aligaen

High availability

Redundancy doesn’t always mean always available
– May need to be powered on manually
* HA (high availability)
– Always on, always available
* May include many different components
working together
– Active/Active can provide scalability advantages
* Higher availability almost always means higher costs
– There’s always another contingency you could add
– Upgraded power, high-quality server components, etc.

How well did you know this?

Not at all

Perfectly

Server clustering

Combine two or more servers
– Appears and operates as a single large server
– Users only see one device
* Easily increase capacity and availability
– Add more servers to the cluster
* Usually configured in the operating system
– All devices in the cluster commonly use the same OS

How well did you know this?

Not at all

Perfectly

Load balancing

Load is distributed across multiple servers
– The servers are often unaware of each other
* Distribute the load across multiple devices
– Can be different operating systems
* The load balancer adds or removes devices
– Add a server to increase capacity
– Remove any servers not responding

How well did you know this?

Not at all

Perfectly

Site resiliency

Recovery site is prepped
– Data is synchronized
* A disaster is called
– Business processes failover to the alternate
processing site
* Problem is addressed
– This can take hours, weeks, or longer
* Revert back to the primary location
– The process must be documented for
both directions

How well did you know this?

Not at all

Perfectly

Hot site

An exact replica
– Duplicate everything
* Stocked with hardware
– Constantly updated
– You buy two of everything
* Applications and software are constantly updated
– Automated replication
* Flip a switch and everything moves
– This may be quite a few switches

How well did you know this?

Not at all

Perfectly

Cold site

No hardware
– Empty building
* No data
– Bring it with you
* No people
– Bus in your team

How well did you know this?

Not at all

Perfectly

Warm site

Somewhere between cold and hot
– Just enough to get going
* Big room with rack space
– You bring the hardware
* Hardware is ready and waiting
– You bring the software and data
– Geographic dispersion
* These sites should be physically different than the
organization’s primary location
– Many disruptions can affect a large area
– Hurricane, tornado, floods, etc.
* Can be a logistical challenge
– Transporting equipment
– Getting employee’s on-site
– Getting back to the main office

How well did you know this?

Not at all

Perfectly

Platform diversity

Every operating system contains potential security issues
– You can’t avoid them
* Many security vulnerabilities are specific to a single OS
– Windows vulnerabilities don’t commonly affect Linux
or macOS
– And vice versa
* Use many different platforms
– Different applications, clients, and OSes
– Spread the risk around

How well did you know this?

Not at all

Perfectly

Multi-cloud systems

There are many cloud providers
– Amazon Web Service, Microsoft Azure,
Google Cloud, etc.
* Plan for cloud outages
– These can sometimes happen
* Data is both geographically dispersed and
cloud service dispersed
– A breach with one provider would not affect the
others
– Plan for every contingency

How well did you know this?

Not at all

Perfectly

Continuity of operations planning (COOP)

Not everything goes according to plan
– Disasters can cause a disruption to the norm
We rely on our computer systems
– Technology is pervasive
There needs to be an alternative
– Manual transactions
– Paper receipts
– Phone calls for transaction approvals
These must be documented and tested before a
problem occurs

How well did you know this?

Not at all

Perfectly

Capacity planning

Match supply to the demand
– This isn’t always an obvious equation
* Too much demand
– Application slowdowns and outages
* Too much supply
– You’re paying too much
* Requires a balanced approach
– Add the right amount of people
– Apply appropriate technology
– Build the best infrastructure

How well did you know this?

Not at all

Perfectly

People

Some services require human intervention
– Call center support lines
– Technology services
* Too few employees
– Recruit new staff
– It may be time consuming to add more staff
* Too many employees
– Redeploy to other parts of the organization
– Downsize

How well did you know this?

Not at all

Perfectly

Technology

Pick a technology that can scale
– Not all services can easily grow and shrink
* Web services
– Distribute the load across multiple web services
* Database services
– Cluster multiple SQL servers
– Split the database to increase capacity
* Cloud services
– Services on demand
– Seemingly unlimited resources (if you pay the money)

How well did you know this?

Not at all

Perfectly

Infrastructure

The underlying framework
– Application servers, network services, etc.
– CPU, network, storage
* Physical devices
– Purchase, configure, and install
* Cloud-based devices
– Easier to deploy
– Useful for unexpected capacity changes

How well did you know this?

Not at all

Perfectly

Recovery testing

Test yourselves before an actual event
– Scheduled update sessions (annual, semi-annual, etc.)
* Use well-defined rules of engagement
– Do not touch the production systems
* Very specific scenario
– Limited time to run the event
* Evaluate response
– Document and discuss

How well did you know this?

Not at all

Perfectly

Tabletop exercises

Study These Flashcards

Performing a full-scale disaster drill can be costly
– And time consuming
* Many of the logistics can be determined through analysis
– You don’t physically have to go through a disaster or drill
* Get key players together for a tabletop exercise
– Talk through a simulated disaster

Fail over

Study These Flashcards

A failure is often inevitable
– It’s “when,” not “if”
* We may be able to keep running
– Plan for the worst
* Create a redundant infrastructure
– Multiple routers, firewalls, switches, etc.
* If one stops working, fail over to the operational unit
– Many infrastructure devices and services can do this
automatically

Simulation

Study These Flashcards

Test with a simulated event
– Phishing attack, password requests, data breaches
* Going phishing
– Create a phishing email attack
– Send to your actual user community
– See who bites
* Test internal security
– Did the phishing get past the filter?
* Test the users
– Who clicked?
– Additional training may be required

Parallel processing

Study These Flashcards

Split a process through multiple (parallel) CPUs
– A single computer with multiple CPU cores or
multiple physical CPUs
– Multiple computers
* Improved performance
– Split complex transactions across
multiple processors
* Improved recovery
– Quickly identify a faulty system
– Take the faulty device out of the list of available
processors
– Continue operating with the remaining processors

Backups

Study These Flashcards

Incredibly important
– Recover important and valuable data
– Plan for disaster
* Many different implementations
– Total amount of data
– Type of backup
– Backup media
– Storage location
– Backup and recovery software
– Day of the week

Onsite vs. offsite backups

Study These Flashcards

On site backups
– No Internet link required
– Data is immediately available
– Generally less expensive than off site
* Off site backups
– Transfer data over Internet or WAN link
– Data is available after a disaster
– Restoration can be performed from anywhere
* Organizations often use both
– More copies of the data
– More options when restoring

Frequency

Study These Flashcards

How often to backup
– Every week, day, hour?
* This may be different between systems
– Some systems may not change much each day
* May have multiple backup sets
– Daily, weekly, and monthly
* This requires significant planning
– Multiple backup sets across different days
– Lots of media to manage

Encryption

Study These Flashcards

A history of data is on backup media
– Some of this media may be offsite
* This makes it very easy for an attacker
– All of the data is in one place
* Protect backup data using encryption
– Everything on the backup media is unreadable
– The recovery key is required to restore the data
* Especially useful for cloud backups and storage
– Prevent anyone from eavesdropping

Snapshots

Study These Flashcards

Became popular on virtual machines
– Very useful in cloud environments
* Take a snapshot
– An instant backup of an entire system
– Save the current configuration and data
* Take another snapshot after 24 hours
– Contains only the changes between snapshots
* Take a snapshot every day
– Revert to any snapshot
– Very fast recovery

Recovery testing

It’s not enough to perform the backup – You have to be able to restore * Disaster recovery testing – Simulate a disaster situation – Restore from backup * Confirm the restoration – Test the restored application and data * Perform periodic audits – Always have a good backup – Weekly, monthly, quarterly checks

Replication

An ongoing, almost real-time backup – Keep data synchronized in multiple locations * Data is available – There’s always a copy somewhere * Data can be stored locally to all users – Replicate data to all remote sites * Data is recoverable – Disasters can happen at any time

Journaling

Power goes out while writing data to storage – The stored data is probably corrupted * Recovery could be complicated – Remove corrupted files, restore from backup * Before writing to storage, make a journal entry – After the journal is written, write the data to storage * After the data is written to storage, update the journal – Clear the entry and get ready for the next

Power resiliency

Power is the foundation of our technology – It’s important to properly engineer and plan for outages * We usually don’t make our own power – Power is likely provided by third-parties – We can’t control power availability * There are ways to mitigate power issues – Short power outages – Long-term power issues

UPS

Uninterruptible Power Supply – Short-term backup power – Blackouts, brownouts, surges * UPS types – Offline/Standby UPS – Line-interactive UPS – On-line/Double-conversion UPS * Features – Auto shutdown, battery capacity, outlets, phone line suppression

Generators

Long-term power backup – Fuel storage required * Power an entire building – Some power outlets may be marked as generator-powered * It may take a few minutes to get the generator up to speed – Use a battery UPS while the generator is starting

3.4 Resiliency Flashcards

(30 cards)