Disaster Recovery + Migrations Flashcards
What is RPO?
Recovery Point Objective:
It’s basically how often you run backups. Or the time between your latest backup and the time of a disaster.
When a disaster happens, the time between the RPO and the disaster is the time in which data is lost.
For example if you back up data every hour. Your RPO is of 1 hour. When disaster strikes, you can go back to an hour ago to recover your data. So the data you lose is between the time of the disaster and the latest backup you have.
Which value identifies how much data loss you are willing to accept in case a disaster happens?
The RPO: Recovery Point Objective
What happens between the RPO time and the time a disaster strikes?
The data you processed is lost.
If you back up your data once a week. What is your RPO?
RTO = 1 week.
What is RTO?
RTO is the amount of downtime an application has or can have.
RTO is the downtime between the time of a disaster and the time your are back in production. (Meaning a replica was activated or a backup was restored and put into production, etc).
What are the disaster recovery strategies?
Backup and Restore
Pilot Light
Warm Standby
Hot Site / Multi Site approach
What are warm and cold disaster recovery setups?
Colder have slower RPO and RTO, warmer have faster RPO and RTO.
For example backup and restore is cold, since it has low rpo and rto compared to site recovery strategies or replication strategies.
What are some backup and restore strategies in AWS?
Backup Examples:
Backup data from corporate DC into S3 and through storage gateway, and move it to glacier with lifecycle policies.
This could have an RPO of 1 day for example.
Or once a week you send a snowball device with tons of data from your dc to an s3 glacier bucket. Here your RPO will be of 1 week.
Also when using services in aws like EBS volumes, RDS, Redshift, you can schedule regular snapshots, you could have an RPO of 1 day, or 2 hours, or 1 hour, based on how frequently you run these snapshots.
These are all backup strategies, and have a higher RPO.
Restore Examples:
Use AMIs recreate EC2 instances and spin up your applications, or restore your RDS, etc, straight from your snapshot.
Restoring your data from backups takes a lot of times, so you get a high RTO as well.
RTO and RPO are high, but backup and restore is cheaper.
What is Pilot Light strategy?
It’s a disaster recovery strategy, in which a smaller version of your production systems (apps, databases, servers, configurations) is always up and running in the cloud. These are the “critical core” components of your systems. You only include what is critical for your business to operate, so that in case of a disaster it’s ready to run and to be scaled into production quickly.
How do you achieve having a version of your critical core running in the cloud? With continuous replication of those critical servers. For example a database.
Then in case of a disaster you can restore from backup the not so critical servers.
This will lower your RPO and RTO.
This could be from onpremises to the cloud. Or from a region in the cloud to another region.
What do you need to do in case of a disaster when using pilot light as a disaster recovery strategy?
Similar to backup and restore, but your critical systems will be already running somewhere else, for example the cloud, so you only need to add the restored not so critical systems.
What is Warm Standby?
All your servers are ready to go in the cloud, but in a minimal size.
Then upon a disaster, you can scale them in the moment to production load.
This could be from onpremises to the cloud. Or from a region in the cloud to another region.
Scaling can be triggered with alarms and ASG in case of EC2, or RDS scaling.
Lower RTO because all backup resources are already running and only need to be scaled so they can meet the necessary resources for production.
More expensive than pilot light because you have more extra resources up on standby.
What role does Route 53 take in disaster recovery situations?
Route 53 can do the failover of your infrastructure when a disaster occurs in your onpremises DC, or in an AWS region. Destination would be another aws region.
Route 53 can reroute unhealthy resources to backup resources, thus performing failovers.
What is the Multi Site / Hot Site approach?
It’s a very low RTO (Minutes or seconds).
You have full production scale running both onpremises and on aws cloud. (Or only on cloud using 2 AWS regions)
This would be an active active setup, with route 53 routing traffic to both sites.
The most expensive option. Lowest RTO and RPO.
Multi DC type of infrastructure.
What are great backup options when backing up data from onpremises to the cloud?
Snowball
Storage Gateway
Which service helps you migrate DNS from a region to another, or from onpremises to aws?
Route 53