Compute & Load Balancing Flashcards
EC2 Main Family Types
R - need lots of RAM - in-memory caches
C - need lots of CPU - compute/database
M - balanced, (think medium) - general/ web app
I - need good local I/O (instance storage) - databases
G - need GPU - video rendering/machine learning
T2/T3 - burstable up to capacity, baseline average
T2/T3 - unlimited burst, baseline average
EC2 Placement Groups
- Cluster - all instances within same AZ - low latency. Good for HPC.
- Spread - max 7 instances per group per AZ - critical applications
- Partition - same AZ but across different partitions, can scale to 100s of instances per group. Good for Cassandra, Kafka, Hadoop
Can you modify EC2 Placement Groups?
Yes. Stop. Use CLI modify-instance-placement. Start
EC2 Launch Types
- On-demand - short workload, predictable pricing, reliable
- Spot - short workload, cheap, can afford to lose instances, can be up to 90% cheaper
- Reserved - long workloads
- Convertible Reserved - long workloads, with the flexibility to convert the instance types
- Schedule Reserved - specific schedule
- Dedicated Instances - no other customer will share hw
- Dedicate Host - entire physical server, control instance placement, for licenses that operate at the core or CPU socket level, CAN define host affinity at reboots
EC2 Instance Recovery
Can use CW alarm to monitor Instance or System Status and recover using EC2 Instance Recovery Action retaining same:
- private, public, elastic ip, placement group, metadata
can then send an alert to SNS
Does ASG support Spot Fleet?
Yes, mix of on-demand and spot instances (setup max willing to pay), can have a mix of instance types
Target Capacity can be huge 10,000 per spot/ec2 fleet, 100,000 per region per fleet. Supports EC2 standalone, ASGs, AWS Batch (Managed Compute Env), ECS
How to upgrade ASG AMI?
- Modify launch configuration/template
- Manually terminate all instances (can use CloudFormation)
- ASG will start launching new instances loading new launch configuration/template
ASG Lifecycle Hooks
- Action before an instance is in service or is terminated
eg cleanup, log extraction, special health check etc
ASG Health Checks
- EC2 Status
- ELB Health checks (HTTP-based)
EC2 Spot Block
Block ppot instances for 1 to 6 hours without interruptions. In rare situation instances can be reclaimed. Batch jobs, data analysis or workloads that are resilient to failures
No critical jobs or databases
ECS ALB integration
Supports dynamic port mapping
Fargate
is like ECS but serverless, just task need task definitions. No more EC2 :)
ECS Security
Two levels of roles:
- EC2 instance roles to have ECS permissions, so that ECS agent can work correctly
- ECS Task level IAM task roles. Trust relationship example
“Principal”: {
“Service”: “ecs-tasks.amazonaws.com”
},
“Action”: “sts:AssumeRole”
ECS Secrets
Can inject from SSM Paramater Store & Secrets Manager
To inject sensitive data into your containers as environment variables, use the secrets container definition parameter.
To reference sensitive information in the log configuration of a container, use the secretOptions container definition parameter.
ECS Networking
none - no network connectivity, no port mappings
bridge - Docker virtual container-based networking
host - bypass Docker networking, use underlying host networking
awsvpc - every task on the instance gets own ENI and private IP
– Default for Fargate
– Monitoring, VPC flow logs, SGs, enchanced security
ECS - Autoscaling
Task level, on classic will need to also scale underlying EC2 on Fargate automatcally handled
CAN use RAM as metric to enable following scaling strategies:
- Target Tracking
- Scheduled Scaling
- Step Scaling
ECS Spot instances
Supported in both ECS classic (cheaper, but more unreliable trigger drain mode on shut down of a spot instance) and Fargate (can specify baseline number of tasks)
AWS Lambda Integrations
- Thumbnail creation in S3, store to S3 and put metadata in DynamoDb for caching
- Serverless cron job, through scheduled CW event.
Lambda Limits
RAM - 128MB -> 3GB
More RAM more CPU, the second CPU gets added after 1.5GB
Timeout is 15 minutes
Has 512 MB temp storage
Deployment package 250MB max including layers
Concurrency execution - 1000 soft limit can be increased
Lambda Latency Considerations
Cold invocation - 100 ms
Warm invocations - a matter of ms
New feature “provisioned concurrency” to keep invocations warm
Hops to API Gateway or CloudFront will add ~100ms
Use X-ray to debug end-to-end latency
Lambda Security
- IAM roles to grant access to other services
- Execution Roles (Resource based policy) to allow:
- other AWS services to invoke the lambda
- other Accounts to invoke or manage lambda
Lambda in VPC
- Is a deployment option
- By default, it in AWS network can access public internet and services (DynamoDB)
- In VPC, gets ENI and can have SGs assigned to it
- To talk to external API
- Needs public subnet NAt Gw and IGW - option 1
- Use Dynamo DB VPC Endpoint Gateway, the private access route to DynamoDB from the private subnet, needs route table configuration - option 2 (better solution)
Lambda Logging, Tracing and Monitoring
- Make sure execution role has permissions to write to CW Logs
- X-ray can be enabled via lambda configuration, also need IAM role permissions to access X-ray
Lambda Sync
Invocations from CLI, SDK and APIG are synchronous
Lambda Async
S3, SNS, CW events. Retries 3 times on errors, need to ensure the processing within lambda is idempotent. Can define DLQ with SNS or SQS as targets.
Lambda Event Source Mapping
Records need to be polled from the source, order is preserved except SQS.
If function fails, the entire batch will be re-processed untill success, meaning:
- Kinesis, DynamoDB streams will stop shard processing, or you can send failed events to SNS or SQS
- SQS FIFO stop unless DLQ is defined