AWS Reliability Pillar (March 2018) Flashcards
Reliability Pillar description
The ability of a system to recover from disruptions, dynamical scale and mitigate disruptions
Service Availability Definition
Percentage of time an application is operating normally
Availability = Normal Operations Time / Total Time
Availability with hard dependencies
Many systems are dependent on other systems, where a downstream interruption interrupts the upstream system
Calculating availability with redundant components
When system uses independent, redundant components (ie AZ’s) the theoretical availability rate is 100% - (product of component failure rates)
Calculating dependency availability
Estimate by determining MTBF and MTTR (mean time to recover)
Availability = MTBF / (MTBF + MTTR)
Describe costs of high availability
innovation suffers because of need to move slowly
more testing and validation
software and services are more expensive
First Step of planning network topology -
Planning IP addressing
Allow IP address space for > 1 VPC per region
Consider cross-account connections (connecting multiple VPCs in the organization)
Within a VPC allow space for multiple subnets that span multiple AZ’s
leave unused CIDR block space in a VPC
Second step of planning network topology -
Ensure resiliency of connectivity
How will you provide resiliency from failures
how will you handle misconfigurations that cause outages\
how will you handle unexpected increases in traffic
how will you handle DoS attacks
Where is connectivity to a VPC governed?
In route table entries
These all function through the route table:
internet gateay
NAT gateway
virtual private gateway
VPC peering
Key services for Network Topology
VPC
also:
Direct Connect
EC2 - run VPN appliances
Route 53 - DNS integrated with ELB helps defend from DoS
ELB - balances across AZ’s, Layer 7 routing, integrates with WAF and auto-scaling
AWS Shield - automatic protection against DDoS
AWS Shield Advanced - protects ELB, CloudFront and Route53 Zones
Questions to ask when planning for reliability (how many nines do you really need)?
Note that 5 9’s is typically too expensive to be feasible but possible.
What problems are you trying to solve
what specific aspects of the app require specific levels of availability
what amount of cumulative downtime dan this workload realistically accumulate in a year
In essence, what’s the real impact of the system being unavailable?
What’s the goal of decomposing an application into its parts and evaluating reliability requirements for each?
To find the ones that truly require high reliability - to minimize the expense in making things HA that don’t need it
Define “Data Plane” and “Control Plane”
Data Plane delivers real time service
ie EC2 instances, RDS databases,
Control Plane configures the environment
ie launching new instances, add/change table metadata
Do data planes typically have higher availability requirements than control planes?
Yes
5 most common ways to improve availability
fault isolation zones
redundant components
micro-service architecture
recovery oriented computing
distributed systems best practices