Compute & Load Balancing Flashcards
EC2 Main Family Types
R - need lots of RAM - in-memory caches
C - need lots of CPU - compute/database
M - balanced, (think medium) - general/ web app
I - need good local I/O (instance storage) - databases
G - need GPU - video rendering/machine learning
T2/T3 - burstable up to capacity, baseline average
T2/T3 - unlimited burst, baseline average
EC2 Placement Groups
- Cluster - all instances within same AZ - low latency. Good for HPC.
- Spread - max 7 instances per group per AZ - critical applications
- Partition - same AZ but across different partitions, can scale to 100s of instances per group. Good for Cassandra, Kafka, Hadoop
Can you modify EC2 Placement Groups?
Yes. Stop. Use CLI modify-instance-placement. Start
EC2 Launch Types
- On-demand - short workload, predictable pricing, reliable
- Spot - short workload, cheap, can afford to lose instances, can be up to 90% cheaper
- Reserved - long workloads
- Convertible Reserved - long workloads, with the flexibility to convert the instance types
- Schedule Reserved - specific schedule
- Dedicated Instances - no other customer will share hw
- Dedicate Host - entire physical server, control instance placement, for licenses that operate at the core or CPU socket level, CAN define host affinity at reboots
EC2 Instance Recovery
Can use CW alarm to monitor Instance or System Status and recover using EC2 Instance Recovery Action retaining same:
- private, public, elastic ip, placement group, metadata
can then send an alert to SNS
Does ASG support Spot Fleet?
Yes, mix of on-demand and spot instances (setup max willing to pay), can have a mix of instance types
Target Capacity can be huge 10,000 per spot/ec2 fleet, 100,000 per region per fleet. Supports EC2 standalone, ASGs, AWS Batch (Managed Compute Env), ECS
How to upgrade ASG AMI?
- Modify launch configuration/template
- Manually terminate all instances (can use CloudFormation)
- ASG will start launching new instances loading new launch configuration/template
ASG Lifecycle Hooks
- Action before an instance is in service or is terminated
eg cleanup, log extraction, special health check etc
ASG Health Checks
- EC2 Status
- ELB Health checks (HTTP-based)
EC2 Spot Block
Block ppot instances for 1 to 6 hours without interruptions. In rare situation instances can be reclaimed. Batch jobs, data analysis or workloads that are resilient to failures
No critical jobs or databases
ECS ALB integration
Supports dynamic port mapping
Fargate
is like ECS but serverless, just task need task definitions. No more EC2 :)
ECS Security
Two levels of roles:
- EC2 instance roles to have ECS permissions, so that ECS agent can work correctly
- ECS Task level IAM task roles. Trust relationship example
“Principal”: {
“Service”: “ecs-tasks.amazonaws.com”
},
“Action”: “sts:AssumeRole”
ECS Secrets
Can inject from SSM Paramater Store & Secrets Manager
To inject sensitive data into your containers as environment variables, use the secrets container definition parameter.
To reference sensitive information in the log configuration of a container, use the secretOptions container definition parameter.
ECS Networking
none - no network connectivity, no port mappings
bridge - Docker virtual container-based networking
host - bypass Docker networking, use underlying host networking
awsvpc - every task on the instance gets own ENI and private IP
– Default for Fargate
– Monitoring, VPC flow logs, SGs, enchanced security
ECS - Autoscaling
Task level, on classic will need to also scale underlying EC2 on Fargate automatcally handled
CAN use RAM as metric to enable following scaling strategies:
- Target Tracking
- Scheduled Scaling
- Step Scaling
ECS Spot instances
Supported in both ECS classic (cheaper, but more unreliable trigger drain mode on shut down of a spot instance) and Fargate (can specify baseline number of tasks)
AWS Lambda Integrations
- Thumbnail creation in S3, store to S3 and put metadata in DynamoDb for caching
- Serverless cron job, through scheduled CW event.
Lambda Limits
RAM - 128MB -> 3GB
More RAM more CPU, the second CPU gets added after 1.5GB
Timeout is 15 minutes
Has 512 MB temp storage
Deployment package 250MB max including layers
Concurrency execution - 1000 soft limit can be increased
Lambda Latency Considerations
Cold invocation - 100 ms
Warm invocations - a matter of ms
New feature “provisioned concurrency” to keep invocations warm
Hops to API Gateway or CloudFront will add ~100ms
Use X-ray to debug end-to-end latency
Lambda Security
- IAM roles to grant access to other services
- Execution Roles (Resource based policy) to allow:
- other AWS services to invoke the lambda
- other Accounts to invoke or manage lambda
Lambda in VPC
- Is a deployment option
- By default, it in AWS network can access public internet and services (DynamoDB)
- In VPC, gets ENI and can have SGs assigned to it
- To talk to external API
- Needs public subnet NAt Gw and IGW - option 1
- Use Dynamo DB VPC Endpoint Gateway, the private access route to DynamoDB from the private subnet, needs route table configuration - option 2 (better solution)
Lambda Logging, Tracing and Monitoring
- Make sure execution role has permissions to write to CW Logs
- X-ray can be enabled via lambda configuration, also need IAM role permissions to access X-ray
Lambda Sync
Invocations from CLI, SDK and APIG are synchronous
Lambda Async
S3, SNS, CW events. Retries 3 times on errors, need to ensure the processing within lambda is idempotent. Can define DLQ with SNS or SQS as targets.
Lambda Event Source Mapping
Records need to be polled from the source, order is preserved except SQS.
If function fails, the entire batch will be re-processed untill success, meaning:
- Kinesis, DynamoDB streams will stop shard processing, or you can send failed events to SNS or SQS
- SQS FIFO stop unless DLQ is defined
Lambda Destinations
Send results to a destinations:
- Asynchronous invocations -> can send to different destinations based on success or failure
- Destinations: SNS, SQS, Lambda, EventBridge bus
DLQ vs Destinations
Destinations you can also send to Lambda or EventBridge bus
SQS DLQ
SQS itself can be configured to send to DLQ
Lambda versions
- $LATEST mutable,
- version immutable, have dedicated ARN = code + configuration
- all versions are accessible
- versions support aliases (dev, test)
- aliases are mutable
Lambda Aliases
- Versions support aliases (dev, test)
- Aliases are mutable, and have dedicated ARNs
- Enable blue/green deployments by assigning weights to lambda functions
Lambda & CodeDeploy
CodeDeploy can help automate traffic shifting for aliases
- Linear: grow traffic every N minutes
- Canary: 10% - 5 minutes, no errors? roll out 100%
- Can create Pre and Post traffic hook for testing canary, and deciding whether to rollback
- All At Once
Load Balancers
- CLB - HTTP(S),TCP
- ALB - HTTP(S), WebSockets
- NLB - TCP, TLS, UDP, WebSockets
can be either in private or public subnet and as a result IP address
What is the purpose of certificate SAN?
Subject Alternative Name (SAN) Certificates can secure multiple fully qualified domain names with a single certificate.
ALB
Layer 7, supports HTTP2, and redirects (ex. from HTTP to HTTPS)
- Multiple applications across machines (target groups)
- Multiple applications on the same machine (containers)
ALB Routing to target groups
- url based
- hostname
- query strings
- headers
- can route to multiple target groups
ALB and lambda
It is possible to have lambda as part of the target group, with the embedded health check
ALB target groups
- EC2 instances can be managed by ASG
- ECS tasks
- Lambda functions, HTTP is transformed into JSON
- IP addresses (must be private)
- Health checks are at the Target Group level
NLB
Layer 4, Websockets, HTTP(S), TCP, UDP
- Can handle millions of requests per second
- NLB has 1 static IP per AZ and supports assigning Elastic IPs (helpful for whitelisting specific IP)
- is commonly used with AWS Private link to expose a service internally
NLB and lambda
not supported
Why use Proxy Protocol with NLB?
Proxy protocal, send additional connection info: sender, destination. You can retrieve originating client IP address
Cross-Zone Load Balancing
Each LB always distributes evenly across all registered instances in all AZs
CLB - not default, no charge if enabled
ALB - default, can’t switch it off, no charge
NLB - not default, pay for cross-zone load balancing if enabled
Load Balancers and Stickiness
Available in CLB and ALB, through session cookies, so same client goes to same backend instance. Cookie has an expiration date which can be edited. Use case: so that the client doesn’t lose the session data. Alternative would be to cache session data in ElasticCache or DynamoDb
API Gateway
- Authorizaton
- OpenAPI support
- API keys, throttling
- resp/req transformations
- API versioning
- Caching
- CORS
- Endpoint can be AWS API (ex. trigger step function)
API Gateway Limits
- 29 seconds timeout
- 10MB max payload size
API Gateway in front of S3
Use lamda to generate presigned url to pass to client to upload the file on its own end, since API Gateway payload limit is only 10MB
API Gateway Endpoint types
- Edge-Optimized. Default (for global clients, better latency). Uses Cloudfront edge locations. APi Gateway itself still lives in one region
- Regional. (all clients in one region). Could manually combine with CloudFront for caching and distribution
- Private. Only accessible from a given VPC. using interface VPC endpoint (ENI). Use resource policy to define access.
API Gateway Caching
- Settings are per method.
- Default TTL 300s. Min 0s. Max 1 hour.
- Clients can invalidate cache using headers. Cache-Control: max-age:0. Needs proper IAM authorization.
- Ability to flush cache immediately
- Cache encryption option
- Capacity from 0.5GB - 237GB
API Gateway Secrity
- Load SSL certificates and have Route53 define CNAME
- CORS
- Resource policy, defining who can access (users, ip, cidr, vpc, vpce)
- Execution role policies to invoke AWS API (eg lambda)
API Gateway Authentication
- IAM user credentials in headers through SigV4
- Lambda Authorizer (OAuth, SAML, 3rd party)
- Cognito User Pools (Client authenticates with Cognito gets token, passes a token, API Gateway knows how to verify Cognito token out of the box using Cognito User Pool)
Route53 Records - Managed DNS
A - hostname to IPv4 AAAA - hostname to IPv6 CNAME - hostname - hostname ALIAS - hostname to AWS resource (ELBs, CloudFront, S3 Bucket, Elastic Beanstalk) -- Can be used for root apex record
Route53 - Routing Policies
- Simple - hostname to a single resource, no health checks no failover, if multiple values are returned the client will choose the random one
- Weighted - health checks, weights do not need to be summed to 100
- Failover (Acitve - Passive) - Active health check mandatory
- Latency - based on user to designated AWS region, has failover if you enable health checks
- Geo location - should create default if no matches found
Route53 - Multivalue Routing
Client chooses which record to use if fails no need to re-issue DNS lookup. Up to 8 records. Can associate health checks with the records
Route53 - Private DNS
- Must enabled VPC settings, enableDNSHostnames & enableDNSSupport
Route53 - Health checks
- Calculated Health checks (health checks monitoring other health checks, how many must pass to pass the parent healthchecks)
- Health checks monitoring CW alarms (eg throttles of DynamoDB, alarms on RDS, custom metrics, etc)
- Health checks themselves can trigger CW alarms
- Based on response code or text (first 5120 bytes)
Route53 - Health Checks and Private Hosted Zones
Since health checks are outside of private VPCs, you can:
- Assign public ip address to resource
- Check health of external resource the internal resource relies on, eg database
- Configure internal resource to use CW metric and setup alarm which is used by health check
Route53 Health Check automatic multi-region failover RDS
- Have EC2 instance monitor DB health and expose rest endpoint OR CW Alarm
- Use 1. as Health Check resources
- Raise CW Alarm in 2. fails
- Raise CW Event or SNS topic when 3. happens
- Trigger lambda to update DNS record in Route53 to point at read replicate
- Promote read replicas to be primary
Route53 - Sharing central private DNS
- Establish connectivity using VPC peering
2. Use CLI to associate VPC with the central private hosted zone (one association per account)
EC2 with Elastic IP
- User Elastic IP re-assign to provide failover to secondary standby EC2 instance
- Cheap, easy but does not scale
Stateless WebApp with horizontal scaling
Multiple EC2, DNS query A record with 1 hour TTL, can get outdated info, clients must have logic to deal with hostname failure resolution, adding new instance may not receive traffic due to TTL
ALB + ASG
Route53 Alias Record 1 hour TTL pointing at ALB with HealthChecks and Multi-AZ pointing at ASG, new instances available right away, time to scale is slow due to EC2 startup and bootstrap times (pre-built AMI can help)
Cannot handle massive peak (need pre-warm), could lose a few requests. CW used for scaling ASG.
ALB + ECS on EC2 in ASG
Same as ALB + ASG, Docker, increased parallelism thanks to multiple task replicas enabled by dynamic port mapping. Tough to auto-scale ECS on EC2
ALB + ECS on Fargate
ALB + ECS on EC2 in ASG, time to be in service much quicker
ALB + Lambda
Limited to Lambda Runtimes, seamless scaling (limited to 1000 functions), can combine with WAF, good for hybrid, cheaper alternative to API Gateway + Lambda at the cost of more advanced API Gateway features
API Gateway + Lambda
More expensive, Pay per request, seamless scaling, soft limits 10000 requests on API gateway/1000 lambda (can be raised), authentication, rate limiting, API keys, caching, lambda cold start can increase latency, use XRay to debug, limited to 10MB payload