Resilient Architecture Flashcards

Design HA and Fault Tolerant Systems

1
Q

How do you autoscale in AWS?

A
  1. Setup an auto-scaling group.
  2. Setup a load balancer.
  3. Configure auto-scaling to listen to Cloudwatch alarms.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between HA and Fault Tolerance and DR?

A

HA is to guarantee maximum uptime

  • there can be minimal disruption to service but it is restored quickly
  • an off-road vehicle carrying a spare tire encounters a flat

FT is to work through malfunctioning components in the system

  • there typically cannot be any loss of functionality during component outage
  • a plane in the air with engine failure uses redundant second engine
  • a patient on an operation table on critical monitoring equipment that cannot stop functioning
  • FT costs a lot to implement and is more complex in design than HA

DR is failure of a larger scale than affects HA or FT

  • human induced or natural
  • entire system is compromised or lost
  • typically solved by having a second physical location to take over, far away from disaster site
  • backups should be stored off site for on-prem solutions
  • determine what your RTO and RPOs need to be for the use-case
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Route53?

A

A DNS service from AWS

  • Register domains, Global Service single DB
  • Hosts Zone Files
  • Managed Nameservers (NS) 4 per domain
  • Liases with the TLD registrar and provides NS records where a particular domain resides (eg: )
  • Zone files store record sets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does DNS work? High level part 1

A
  • Root Hints file on the DNS resolver (ISP provided) points to the 13 DNS Root servers where the Root Zone lies
  • Root Zone is authoritative
  • Root Zone is a DB of the top level domains (.com etc)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the different types of DNS records?

A
  • “A” record points to the IPv4 of the server
  • “CNAME” - canonical name - points to the “A” record and are alternate names pointing to the same IP (eg, ftp.google.com, mail.google.com)
  • CNAME only can point to A names not to an IP address (exam question!)
  • MX records: Points to a server for a specific mail domain
  • TXT records: Arbitrary text to prove domain ownership
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does TTL on a DNS record indicate?

A

TTL values indicate how long the resolver can cache the IPv4 returned from the domain resolution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an ALB?

A

Application Load Balancer

  • “Target” is a single compute resource
  • “Target groups” are groups of targets
  • Rules are evaluated to determine which target group to send requests to
  • Rules are “path” based or “host” based
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are Launch configurations and Launch Templates?

A
  • Templates came after Configurations
  • Allows you to define the configuration an EC2 in advance (ami type, memory, networking, user data, iam role attached etc)
  • LTs have versions, is recommended over LC
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is an ASG?

A

Auto Scaling Group

  • automatic scaling for EC2
  • uses the EC2 configuration within LTs or LCs
  • 3 important values: Min size, Desired and Maximum (eg: 1:2:4)
  • Provision or terminate to keep at Desired level
  • Scaling policies based on Metrics
  • Runs in a VPC across one or more Subnets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Types of Auto scaling?

A
  • Manual
  • Scheduled scaling based on time
  • Dynamic scaling
    • simple scaling based on a metric, example: cpu - if CPU > 50% increase desired capacity else remove 1 from desired capcity
    • stepped scaling - lets you define more details - add one instance if cpu > 50%, add 3 instances if cpu > 80% (bigger or smaller steps), react in a more extreme way, preferable to simple
    • target scaling: eg: 40% desired aggregate cpu across all instances in the group
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is cool down period?

A

EC2 has min billing so bringing in instances in and out too frequently can be costly
Cool down period waits for a the time period before a scaling action is applied since the last scaling action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are NLBs?

A

Network Load balancer
Only understand TCP and UDP, non-HTTP(s)
~100ms vs ~400ms for ALBs
Rapid scaling - millions of requests per second
1 interface with static IP/AZ, can use EIPs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is SSL Offload

A

ELBs have 3 types of SSL off load:
1. Bridging - SSL is terminated on the LB, LB needs an SSL Cert matching the domain name, new encrypted connection between ALB and EC2 instances (ALB decrypts and then re-encrypts when talking to EC2 instances so EC2 needs to decrypt which can be an overhead)

  1. Pass through - NLB usually uses this, does not decrypt, passes it through to EC2, cannot decrypt data, AWS does not know what cert you use on the EC2 instance, still has admin and compute overhead on EC2
  2. SSL Offload - ELB has cert, but cert not needed on EC2 instance since connection is not HTTPS. Only ELB decrypts, so no overhead on the EC2 instances
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Session Stickiness?

A

If enabled, the LB generates a cookie called “aws-alb”
Duration defined by you (1s to 7days)
LB will go to the same backend EC2 instance if the cookie is present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Boot time to service time?

A

Time required by AWS to provision EC2, software updates and installation within the OS - for AWS provided AMIs that is in mins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is SQS

A
Simple Queue Service
HA, Performant by design
Standard Qs and FIFO Qs 
FIFO guarantees order
Standard Qs try to devlier in order but not guranteed
256K message size max

VisibilityTimeout - Fault Tolerance - Client can explicitly delete the message after processing. If Client dies when processing it then it comes back into Q after VT so another worker can see and process it

DeadLetter Q - problematic messages, corrupt messages can be dropped here for later examination

ASGs can scale instances based on length of a Q

17
Q

What is Fan out architecture WRT SQS and SNS

A

You publish a message to a SNS topic

The message is distributed to multiple SQS queues with different workloads at the end of each Q

Each workload can then work in parallel on the message received in its own Q

Useful when multiple un-connected things have to be done based on a single event (me)

18
Q

Standard Qs vs FIFO Q

A

STD - multi lane highway, same msg can be delivered twice, scale much more than FIFO queues but messages could be out of order

FIFO - single lane highway, msgs are delivered once, 3K/s with batching or 300/s without batching

Billed based on request - one request can receive between 1 and 10 msgs, request can return 0 or more messages, so not cost-efficient if you call it very frequently

Can encrypt messages using KMS as it sits in the Q

19
Q

Short Polling vs Long Polling

A

SP = short duration, could return 0 messages

LP can specify a wait time, upto 20s, it will wait for messages to arrive, this is how you should poll SQS

20
Q

What is Kinesis

A

Scalable streaming service

Designed to ingest lots of data from lots of apps

Public and HA by design

Persistence - rolling 24 hour window, data stays for 24 hours by default, older data is replaced by new data entering

Lots of producers pushing to a stream

Shard architecture - 1MB ingestion and 2MB consumption capacity, Kinesis Data Records are stored across shards

21
Q

Kinesis Data Firehose

A

KDF Can move data from a Kinesis stream en-masse into another destination like S3 to store it for a longer time

22
Q

Difference between SQS and Kinesis - how to pick between the two

A

Is it about ingestion of data or about async communication, de-coupling between entities?

Ingestion of data = Kinesis
Decoupling = SQS