Business Continuity Flashcards
Define business continuity…
Seeks to minimise business activity disruption when something unexpected happens
Define Disaster recovery…
The act of responding to an event that threatens business continuity
Define high availability…
Designing in redundancies to reduce the chance of impacting service levels
Define fault tolerance…
The ability to tolerate faults. By designing in the ability to absorb problems without impacting service levels
What is a service level agreement?
An agreed goal or target for a given service on its performance or availability
Define RTO…
Recovery Time Objective…
The time that it takes after a disruption to restore business processes to their service levels
Define RPO…
Recovery Point Objective…
An acceptable amount of data loss measured in time
What does the Business continuity plan define?
The acceptable RPO and RTO
What justifies the HA investment?
The RPO and RTO
What does the disaster recovery plan deliver?
The RTO and RPO
Name and provide examples of the 9 categories of disasters…
1) Hardware failure- Network switch power supply fails and brings down a LAN
2) Deployment failure- Deploying a patch that breaks a key ERP business process
3) Load induced- DDoS attack
4) Data induced- Ariane rocket float conversion error
5) Credential expiration- An SSL/TLS certificate expires on your site
6) Dependency- S3 subsystem failure which causes other services to fail
7) Infrastructure- A construction crew cuts through a fibre cable
8) Identifier exhaustion- We currently don’t have sufficient capacity in the AZ you have requested
9) human error!
What are the 4 disaster recovery architecture?
1) Backup and restore
2) Pilot light
3) Warm standby
4) Multi-site
Name 2 pros and cons of a backup and restore DR architecture…
Pro-
1) Very common entry point into AWS
2) Minimal effort to configure
Con-
1) Least flexibility
2) Analogous to off-site back-up
Name 2 pros and 3 cons of a Pilot light DR architecture…
Pro-
1) Cost effective way to maintain a “hot site”
2) Suitable for a variety of landscapes and applications
Con-
1) Usually requires manual intervention for fail over
2) Spinning up cloud environments will take mins to hours
3) Must keep AMIs up-to-date with on-prem counterparts
Name 2 pros and cons of a Warm standby DR architecture…
Pro-
1) All services are up and ready to accept a failover faster within minutes or seconds
2) Can be used to used as a “shadow environment” for testing or production staging
cons-
1) Resources would need to be scaled to accept production load
2) Still requires some environment adjustments but couple be scripted
Name 3 pros and 2 cons of a multi-site DR architecture…
pro-
1) Ready all the time to take full production load-effectively a mirrored data center
2) Fails over in seconds or less
3) No or little intervention required
Cons-
1) Most expensive option
2) Can be perceived as wasteful as you have resources just standing around waiting for the primary to fail
Are EBS volumes replicated automatically within a single AZ or multi-AZ by default?
A single AZ by default
… This makes them vulnerable to AZ failure
What is RAID0?
Aka stripping, provides the fastest read and writes but no redundancy of data stored on drives
What is RAID1?
aka mirroring, where data is mirrored across 2 drives. Can tolerate total failure of 1 drive.
What is RAID6?
High redundancy as 2 drives can fail, but write times very slow
Which RAID configuration does AWS NOT recommend? and why?
RAID5/6 as EBS volumes are accessed over the network and writing parity bits sucks up IOPS
Which RAID configuration does AWS recommend?
RAID1
Does EFS support multi-AZ?
Yes
What is critical for rapid failover in HA and BC systems?
Up-to-date-AMIs
What is the only way to GUARANTEE that a resource such as an EC2 instance will be available when you need it?
Using reserved instances
How can Route53 be used to provide a DR solution?
Be conducting health checks and re-directing traffic e.f. on-prem to AWS env
Describe the order of preference when choosing a database in terms of HA and BC…
DynamoDB > Aurora (redundant and auto recover features) > Multi-AZ RDS with frequent RDS snapshots
Is a master to a standby asynchronous or synchronous in a Multi-AZ RDS architecture?
Synchronous
Is a master to a read replica asynchronous or synchronous in a Multi-AZ RDS architecture?
Asynchronous
What happens if we lose a master RDS in a multi-AZ RDS architecture?
The standby is promoted to the master
What happens if we an entire region in an RDS multi-AZ RDS architecture?
The read replica is promoted to the master and another RDS is spun up to be the read replica and stand by. This is manual but can be scripted using a cloud watch alarm.
Does RedShift support multi-AZ deployment?
No
What is the best HA option for RedShift?
The best option is to use a multi-node cluster that supports data replication and node recovery
What is your only option to restore if a single node RedShift cluster fails?
You have to restore from S3.
RedShift does not support replication.
Does memchaced support replication?
No a node failure will result in data loss
How can you minimise data lost in Memcached?
You can use multiple nodes in each shard to minimise data loss on a AZ failure
How would you architect HA in redis?
Use multiple nodes in each shard and distribute these nodes across multiple AZs. Can also enable muli-AZ replication to permit automatic failover in the primary nodes fails.
How do you ensure HA in your VPC network when using VPN?
Create at least 2 VPN tunnels into your virtual private gateway
What is FMEA?
Failure mode and effects analysis
A systematic process to examine-
What could go wrong, What impact it might have, What is the likelihood of it occurring, and what is our ability to detect and react.
What are the 3 steps in a FMEA?
1) Collect all possible failures
2) Assign scores (risk priority number- high == worse)
3) Prioritise based on risk score RPN- Highest first
What is the relationship between RPO and BC?
The recovery point objective will define the potential for data loss during a disaster. This can inform an expectation of manual data re-entry for BC planners
Which RAID option provides the highest write performance?
RAID0
What is Aurora Global database?
A service that allows you to failover to a secondary cluster in a different region. It means your database will survive even in the unlikely event of a regional degradation or outage