Resilient Architecture Flashcards
Design HA and Fault Tolerant Systems
How do you autoscale in AWS?
- Setup an auto-scaling group.
- Setup a load balancer.
- Configure auto-scaling to listen to Cloudwatch alarms.
What is the difference between HA and Fault Tolerance and DR?
HA is to guarantee maximum uptime
- there can be minimal disruption to service but it is restored quickly
- an off-road vehicle carrying a spare tire encounters a flat
FT is to work through malfunctioning components in the system
- there typically cannot be any loss of functionality during component outage
- a plane in the air with engine failure uses redundant second engine
- a patient on an operation table on critical monitoring equipment that cannot stop functioning
- FT costs a lot to implement and is more complex in design than HA
DR is failure of a larger scale than affects HA or FT
- human induced or natural
- entire system is compromised or lost
- typically solved by having a second physical location to take over, far away from disaster site
- backups should be stored off site for on-prem solutions
- determine what your RTO and RPOs need to be for the use-case
What is Route53?
A DNS service from AWS
- Register domains, Global Service single DB
- Hosts Zone Files
- Managed Nameservers (NS) 4 per domain
- Liases with the TLD registrar and provides NS records where a particular domain resides (eg: )
- Zone files store record sets
How does DNS work? High level part 1
- Root Hints file on the DNS resolver (ISP provided) points to the 13 DNS Root servers where the Root Zone lies
- Root Zone is authoritative
- Root Zone is a DB of the top level domains (.com etc)
What are the different types of DNS records?
- “A” record points to the IPv4 of the server
- “CNAME” - canonical name - points to the “A” record and are alternate names pointing to the same IP (eg, ftp.google.com, mail.google.com)
- CNAME only can point to A names not to an IP address (exam question!)
- MX records: Points to a server for a specific mail domain
- TXT records: Arbitrary text to prove domain ownership
What does TTL on a DNS record indicate?
TTL values indicate how long the resolver can cache the IPv4 returned from the domain resolution
What is an ALB?
Application Load Balancer
- “Target” is a single compute resource
- “Target groups” are groups of targets
- Rules are evaluated to determine which target group to send requests to
- Rules are “path” based or “host” based
What are Launch configurations and Launch Templates?
- Templates came after Configurations
- Allows you to define the configuration an EC2 in advance (ami type, memory, networking, user data, iam role attached etc)
- LTs have versions, is recommended over LC
What is an ASG?
Auto Scaling Group
- automatic scaling for EC2
- uses the EC2 configuration within LTs or LCs
- 3 important values: Min size, Desired and Maximum (eg: 1:2:4)
- Provision or terminate to keep at Desired level
- Scaling policies based on Metrics
- Runs in a VPC across one or more Subnets
Types of Auto scaling?
- Manual
- Scheduled scaling based on time
- Dynamic scaling
- simple scaling based on a metric, example: cpu - if CPU > 50% increase desired capacity else remove 1 from desired capcity
- stepped scaling - lets you define more details - add one instance if cpu > 50%, add 3 instances if cpu > 80% (bigger or smaller steps), react in a more extreme way, preferable to simple
- target scaling: eg: 40% desired aggregate cpu across all instances in the group
What is cool down period?
EC2 has min billing so bringing in instances in and out too frequently can be costly
Cool down period waits for a the time period before a scaling action is applied since the last scaling action
What are NLBs?
Network Load balancer
Only understand TCP and UDP, non-HTTP(s)
~100ms vs ~400ms for ALBs
Rapid scaling - millions of requests per second
1 interface with static IP/AZ, can use EIPs
What is SSL Offload
ELBs have 3 types of SSL off load:
1. Bridging - SSL is terminated on the LB, LB needs an SSL Cert matching the domain name, new encrypted connection between ALB and EC2 instances (ALB decrypts and then re-encrypts when talking to EC2 instances so EC2 needs to decrypt which can be an overhead)
- Pass through - NLB usually uses this, does not decrypt, passes it through to EC2, cannot decrypt data, AWS does not know what cert you use on the EC2 instance, still has admin and compute overhead on EC2
- SSL Offload - ELB has cert, but cert not needed on EC2 instance since connection is not HTTPS. Only ELB decrypts, so no overhead on the EC2 instances
What is Session Stickiness?
If enabled, the LB generates a cookie called “aws-alb”
Duration defined by you (1s to 7days)
LB will go to the same backend EC2 instance if the cookie is present
What is Boot time to service time?
Time required by AWS to provision EC2, software updates and installation within the OS - for AWS provided AMIs that is in mins.
What is SQS
Simple Queue Service HA, Performant by design Standard Qs and FIFO Qs FIFO guarantees order Standard Qs try to devlier in order but not guranteed 256K message size max
VisibilityTimeout - Fault Tolerance - Client can explicitly delete the message after processing. If Client dies when processing it then it comes back into Q after VT so another worker can see and process it
DeadLetter Q - problematic messages, corrupt messages can be dropped here for later examination
ASGs can scale instances based on length of a Q
What is Fan out architecture WRT SQS and SNS
You publish a message to a SNS topic
The message is distributed to multiple SQS queues with different workloads at the end of each Q
Each workload can then work in parallel on the message received in its own Q
Useful when multiple un-connected things have to be done based on a single event (me)
Standard Qs vs FIFO Q
STD - multi lane highway, same msg can be delivered twice, scale much more than FIFO queues but messages could be out of order
FIFO - single lane highway, msgs are delivered once, 3K/s with batching or 300/s without batching
Billed based on request - one request can receive between 1 and 10 msgs, request can return 0 or more messages, so not cost-efficient if you call it very frequently
Can encrypt messages using KMS as it sits in the Q
Short Polling vs Long Polling
SP = short duration, could return 0 messages
LP can specify a wait time, upto 20s, it will wait for messages to arrive, this is how you should poll SQS
What is Kinesis
Scalable streaming service
Designed to ingest lots of data from lots of apps
Public and HA by design
Persistence - rolling 24 hour window, data stays for 24 hours by default, older data is replaced by new data entering
Lots of producers pushing to a stream
Shard architecture - 1MB ingestion and 2MB consumption capacity, Kinesis Data Records are stored across shards
Kinesis Data Firehose
KDF Can move data from a Kinesis stream en-masse into another destination like S3 to store it for a longer time
Difference between SQS and Kinesis - how to pick between the two
Is it about ingestion of data or about async communication, de-coupling between entities?
Ingestion of data = Kinesis
Decoupling = SQS