AWS: CDA Flashcards
What is Elastic Beanstalk?
- Developer-centric view of deploying an app on AWS
- Fully managed service
- Handles capacity provisioning, load balancing, scaling, monitoring
- Free service but underlying AWS resources will have costs involved
What are the components involved in Elastic Beanstalk?
- Application
- Collection of components
- Application version
- Iteration of app code
- Environment
- Collection of AWS resources running an app version
- Env tiers to support diff types of apps
- Multiple env can be created ie. dev/prod
What are the use cases for the different environment tiers?
- Web server environment tier
- Website
- Web app
- Web app serving HTTP requests
- Worker environment tier
- Processing long-running workloads on demand
- Perform tasks on a schedule
List the different deployment options available for Elastic Beanstalk
- All-at-once deployment
- Fastest
- Instances are down during update
- No additional cost
- Rolling deployment
- Zero downtime
- Deployment time depends on number of instances
- No additional cost
- Rolling deployment with additional batches
- Zero downtime
- Deployment time depends on number of instances
- Small additional cost
- Immutable deployment
- Zero downtime
- Longest deployment
- High cost, double capacity
- Quick rollback in case of failure
- Blue/green deployment
- Zero downtime
- Traffic-splitting deployment
- Zero downtime
- Quick rollback in case of failure
How do rolling deployments work?
- Apps running below capacity with set bucket size
- Instances in the bucket will be down during update
- Once instances in the bucket are updated, process repeats for next bucket (batch of instances)
How does rolling deployments with additional batches work?
Deployments
- App running at capacity with set bucket size
- New instances created with upgraded version
- Existing instances go through rolling deployments
- After rolling deployment is complete, the newly-created instances will be terminated
How does blue/green deployments work?
- Create a new environment (green) where new version is deployed
- Easy rollback to old environment (blue) if issues arise
How do traffic-splitting deployments work?
Deployments
- Used for canary testing
- New app version deployed to temp ASG with same capacity
- Small % of traffic sent to temp ASG for a configurable amount of time
- New instances migrated from temp to original ASG and then old version is terminated
- Automated rollback if issues arise
What is the lifecycle policy for Elastic Beanstalk?
- A configurable policy to limit no. of app versions to retain for future deployments
- Limit by count
- Limit by age
- Must be enabled first to configure policy
What happens under the hood for Elastic Beanstalk?
- Relies on CloudFormation
- CloudFormation is infra as code
- Used to provision other AWS services
What is Elastic Beanstalk cloning?
- Clone an environment with exact same configuration
- All resources and config are preserved
- After cloning an environment, you can modify settings
- Useful for deploying a “test” version of your app
What is API Gateway?
- Serverless service to manage and secure APIs
- A single interface for all microservices
- Use API endpoints with various resources
- Apply forwarding and transformation rules at API Gateway level
What are some features of API Gateway?
Development
- Support websocket protocols
- Transform and validate requests/responses
- Handle request throttling
- Cache API responses
- Handle API versioning
- Handle different environments
- Handle security
What are the different endpoint types for API Gateway?
- Edge-optimised (default)
- Requests routed through CloudFront Edge locations to improve latency
- API Gateway still only lives in one region
- Regional
- For clients within same region
- Could be manually combined with CloudFront
- Private
- Only accessed from your VPC using interface VPC endpoint
- Use resource policy to define access
What are the user authentication strategies available for API Gateway?
- IAM roles
- Useful for internal applications
- AWS Cognito
- Useful for external users
- Custom authoriser (your own logic via Lambda function)
How can you have security with your own custom domain name by integrating API Gateway with ACM?
- If using edge-optimised endpoint, certificate must be in us-east-1
- If using regional endpoint, certificate must be in API Gateway region
- Must setup CNAME or A-alias record in Route 53
What are stage variables in API Gateway?
Development
- Similar to environment variables
- Used to update frequently changing config values
- If used in Lambda functions, they are passed to the “context” object
What are the use cases for stage variables in API Gateway?
- They can be used in:
- Lambda function ARN
- HTTP endpoint
- Parameter mapping templates
- Use cases:
- Configure HTTP endpoints that the stages talk to (dev, test, prod)
- Pass config parameters to Lambda functions through mapping templates
How to perform canary deployments in API Gateway?
- Usually done with prod
- Choose % of traffic the canary channel receives
- Metrics/logs are separate for better monitoring
- Stage variables can be overridden for canary deployments
- Once canary deployments have been tested and if all good, they can be promoted to entire stage
What are the different integration types for API Gateway?
- Mock integration type
- API Gateway returns a response without sending a request to backend
- HTTP/AWS services
- Both integration req and res must be configured
- Setup data mapping using mapping templates for req and res
- AWS proxy integration type
- Incoming req from client is the input to Lambda
- Lambda function is responsible for logic of req/res
- No mapping template/headers/query params are passed as arguments
- HTTP proxy integration type
- HTTP req is passed to backend
- HTTP res from backend is forwarded by API Gateway
- No mapping templates
- Optionally add HTTP headers if needed eg. API key
What are mapping templates in API Gateway?
- Templates used to modify req/res
- Rename/modify query string params
- Modify body content
- Add headers
- Filter result output
- To set the template, the content-type must be set to either application/json or application/xml
- Not used for proxy integration types
How can API Gateway utilise request validation?
- Importing Open API definitions
- The spec is used to verify if req corresponds to proper schema before proceeding with req
- If validation fails, API Gateway immediately fails req
- Reduces unnecessary calls to backend
How does caching work in API Gateway?
- Caching reduces number of calls to backend
- API Gateway will first check cache
- If cache miss, call backend
- Default TTL is 300s
- Cache is expensive - makes sense for prod but may not make sense for dev/test
What is the difference between latency vs integration latency in API Gateway?
- Integration latency
- Time between when API Gateway relays req to backend and receives a response from backend
- Latency
- Time between when API Gateway receives req from client and when it returns response to client
- Includes integration latency and other API Gateway overhead
What are WebSockets?
- Two-way interactive communication between user’s browser and a server
- Server can push information to client
- Enables stateful application use cases
- Often used for real-time apps
How does API Gateway handle WebSocket routing?
- API Gateway uses a route key table that incoming JSON messages are evaluated against
- If no routes, sent to $default
- Route is then connected to the backend setup through API Gateway
What are the different security strategies for API Gateway?
- IAM
- Great for existing users/roles in AWS
- Authentication via IAM
- Authorisation via IAM policies
- Can be combined with resource policies for cross-accounts
- Leverages sig 4 where IAM creds are in headers
- Custom authoriser
- Great for 3rd party tokens
- Authentication via 3rd party system but verified in Lambda
- Authorisation via Lambda fn
- Lambda must return IAM policy for user - result is cached
- Cognito User Pool
- Great for fully managed service
- Tokens expire automatically
- Authentication via Cognito User Pools
- Authorisation via API Gateway methods
What is DynamoDB?
- Managed NoSQL database
- Highly available with replication across multi AZ
- Scales to massive workloads
- Fast and consistent performance
- Low cost and auto-scaling capabilities
What are features of NoSQL databases?
- Non-relational databases
- eg. MongoDB, DynamoDB
- Distributed
- Scale horizontally
- Do not support query joins/aggregation computations
- All data needed is presented in one row
Describe DynamoDB tables
- Each table has a primary key
- Must be decided at creation time
- Non-null
- Each item has attributes
- Similar to columns but more powerful
- Can be added over time - can be null at creation time
What are the different strategies of choosing a primary key for DynamoDB tables?
Development
- Partition key (hash)
- Unique for each key
- Diverse so data is distributed
- Example: “user_id” for “users” table
- Partition key + sort key
- Combination must be unique for each item
- Data grouped by partition key
- Example: “user_id” for partition key and “game_id” for “users_games” table
What happens when read and write throughout is exceeded for DynamoDB?
- Table must have provisioned read/write capacity units
- Can setup auto-scaling
- Throughput can be exceeded temporarily using burst capacity
- If burst capacity has been consumed, there will be a ProvisionThroughputExceededException
- It’s then advised to do an exponential backoff retry or distribute partition keys
What is the difference between eventually consistent read vs strongly consistent read?
- Eventually consistent read
- Possibility of a lag where data has not been replicated but a read has been made
- Strongly consistent read
- Ensures no data staleness
- Consumes twice the RCUs
What are operations for writing data to DynamoDB?
- PutItem
- Creates new item or replaces old item (same primary key)
- Consumes WCUs
- UpdateItem
- Edit an existing item’s attributes or adds a new item if it doesn’t exist
- Can be used to implement atomic counters
- Conditional writes
- Accepts a write/update/delete only if conditions are met
- Helps with concurrent access to items
What are the operations for reading data from DynamoDB?
Development
- GetItem
- Read based on primary key
- Eventually consistent read (default) and have option to use strongly consistent reads (more RCUs)
- Query
- Returns items based on KeyConditionExpression and FilterExpression
- Ability to paginate results
- Scan
- Scans entire table and then filter data (inefficient)
- Consumes a lot of RCUs
- Use parallel scans for faster performance - consumes significantly higher RCUs
What are the operations for deleting items from DynamoDB?
- DeleteItem
- Deletes individual item
- Ability to perform conditional delete
- DeleteTable
- Delete entire table and its contents
What are the benefits of batching operations in DynamoDB?
- Reduce latency by reducing number of API calls
- Operations are done in parallel for better performance
What are the operations for batching in DynamoDB?
- BatchGetItem
- Returns items from one or more tables
- Items retrieved in parallel to reduce latency
- BatchWriteItem
- Can’t update items
- UnprocessedItems
- Failed write operations
What is PartiQL?
- SQL-like syntax to manipulate DynamoDB tables
- Run queries across multiple DynamoDB tables
- Supports some (not all) SQL statements
- INSERT
- UPDATE
- SELECT
- DELETE
What is the difference between GSI vs LSI?
- Global Secondary Index (GSI)
- Query a specific index that spans all data in base table, across all partitions
- Support eventual consistency only
- Can be added/updated after table creation
- Must provision RCUs and WCUs for index
- Queries or scans on this index consume capacity units from the index, not from the base table
- If writes are throttled, main table will also be throttled
- Local Secondary Index (LSI)
- Only added at table creation
- Uses RCUs and WCUs of main table
- No special throttling consideration
What is optimistic locking in DynamoDB?
- Conditional writes
- A strategy to ensure an item hasn’t changed before it is updated/deleted
What is DynamoDB Accelerator (DAX)?
Fully managed, highly available, in-memory cache for DynamoDB
What are the key features of DAX?
- Microseconds latency for cached reads
- Compatible with existing DynamoDB APIs
- Doesn’t require any modifications to application logic
- Solves the “hot key” problem (too many reads)
- Secure
- Multi AZ
- Min 3 nodes recommended for prod
- 5 mins TTL
- Up to 10 nodes per cluster
What are DynamoDB streams?
- Ordered stream of item-level modifications (create/update/delete) in a table
- Streamed records can be:
- Sent to Kinesis
- Read by Lambda
- Data retention up to 24 hours
- Records are not retroactively populated after it is enabled
What are some use cases for DynamoDB streams?
- React to changes in real-time
- Analytics
- Implement cross-region replication
How do DynamoDB streams work with Lambda?
- Define Event Source Mapping to poll from DynamoDB streams and receive records in batches
- Ensure Lambda function has appropriate permissions to read from stream
- Lambda function is invoked synchronously with batch of records
How does the TTL feature work in DynamoDB?
- Automatically delete items after an expiry timestamp
- Expired items are deleted within 48 hrs
- Deleted from both GSI and LSI
- TTL attribute must be a “number” data type with Unix epoch timestamp value
- Doesn’t consume any WCUs
What are the use cases for enabling a TTL in DynamoDB?
- Reduce stored data by keeping only current items
- Adhere to regulatory obligations
In DynamoDB CLI, what does —projection-expression flag do?
One or more attributes to retrieve
In DynamoDB CLI, what does —filter-expression flag do?
Filter items before being returned
How does the transactions feature work in DynamoDB?
- Co-ordinated all-or-nothing operations
- Provides ACID (atomicity, consistency, isolation, durability)
- Read modes:
- Eventual consistency
- Strong consistency
- Transactional consistency
- Write modes:
- Standard consistency
- Transactional consistency
- Consumes 2x WCUs and RCUs
- Performs 2 operations for every item (prepare and commit)
What are the use cases for transactions?
- Financial transactions
- Managing orders
- Multi-player games
What is the difference between using DynamoDB as session state cache vs ElastiCache or EFS?
- ElastiCache
- ElastiCache is in-memory but DynamoDB is serverless
- Both are key/value stores
- DynamoDB has auto-scaling
- EFS
- Must be attached to EC2 instances as network drives
What are the different write types of DynamoDB?
- Concurrent writes
- Conditional writes
- Atomic writes
- Batch writes
What are some AWS services that can be used to decouple applications?
- SQS
- SNS
- Kinesis
What is SQS?
- Fully managed service that queues messages
- Consist of:
- Producer(s) - sends messages to the queue
- Consumer(s) - polls and processes messages from the queue
What are the different types of SQS queues?
- Standard
- FIFO
What are key features of SQS?
- Unlimited throughput
- Unlimited no. of messages in the queue
- Retention of messages
- Default: 4 days
- Max: 14 days
- Low latency (<10 ms)
- Message size limit of 256 KB/message
- Can have duplicate messages - at least once delivery
- Can have out of order messages - best effort ordering
How does SQS produce messages?
- Send messages using SDK
- SendMessage API
- Message persisted in SQS until a consumer deletes it which signifies that it has been processed
How does SQS consume messages?
Development
- Poll SQS for messages
- Receive up to 10 messages at a time
- Process the messages
- Delete the message using SDK
- DeleteMessage API
- Scale consumers horizontally to improve throughout
- Can have multiple consumers process messages in parallel
How can SQS be used with ASG to increase throughput?
- Have multiple EC2 instances in an ASG consuming SQS messages
- SQS has built-in CloudWatch metrics that can trigger an alarm if messages go over a certain number
- ApproximateNumberOfMessages
- Alarm can trigger ASG to scale
What are the security strategies for SQS?
- Encryption
- In-flight encryption using HTTPS API
- At-rest encryption using KMS keys
- Client-side encryption (client will need to handle encryption/decryption itself)
- Access controls
- IAM policies to regulate access to SQS API
- SQS queue access policies
- Resource policy (similar to S3 bucket policies)
- Useful for cross-account access
- Useful for other services to write to SQS
What does message visibility timeout mean in SQS?
- After message is polled by consumer, it becomes invisible to other consumers
- Default 30s for messages to be processed
- After message visibility timeout lapses, message is then visible again in SQS
What are dead letter queues in SQS?
- If consumer fails to process a message within visibility timeout, then message goes back into the queue
- Threshold can be set to limit how many times a message can go back into the queue
- After threshold (MaximumReceives) is exceeded, message sent to DLQ
- DLQ must inherit its queue type
- DLQ of FIFO queue must also be FIFO queue
- DLQ of standard queue must also be standard queue
- Useful for debugging
- Set expiry time (14 days retention) to process messages before expiry
What is the “re-drive to source” feature of DLQ?
- Help consume messages in DLQ to understand what is wrong
- Allow manual inspection and debugging
- When code is fixed, we can re-send message back into source queue in batches to be reprocessed
- No custom code needed
What are delay queues in SQS?
- Delays a message so consumers can’t receive it immediately
- Default is 0s - message available immediately
- Can be delayed up to 15 mins
- Default can be overridden on send using DelaySeconds parameter
What is long polling in SQS?
- When a consumer requests messages from the queue, it can optionally wait for messages to arrive if there are none in the queue
- Wait time can be 1-20s
- Long polling decreases no. of API calls while increasing latency
- Can be enabled at:
- Queue level
- API level using ReceiveMessageWaitTimeSeconds
What is SQS Extended Client?
- Java library
- Used to send large messages (ie. 1GB) due to standard size is 256 KB
How does SQS Extended Client work?
- Producer stores the large message in S3 bucket
- Producer sends metadata message to SQS which references the path to S3 bucket
- Consumer receives metadata message from SQS and uses it to retrieve large message from S3 bucket
What are the key features of FIFO SQS queues?
- Messages are processed in order by consumer
- Exactly-once send capability by removing duplicates
- Limited throughput
- 300 messages without batching
- 3000 messages with batching
What are the ways FIFO queues handle message deduplication?
- Deduplication interval is 5 mins
- Two methods:
- Content-based deduplication via hashing the message body
- Explicitly provide Message Deduplication ID
How does FIFO queues handle message grouping?
- If same MessageGroupID is used in a FIFO queue, you can only have 1 consumer and all messages are in order
- Ordering at the level of a subset of messages, specify different values for MessageGroupID
- Each group ID can have a different consumer
- Ordering across groups is not guaranteed
What is SNS?
- Simple Notification Service
- Uses the pub/sub model
- The “event producer” sends messages to SNS topic
- “Event receivers” subscribe to SNS notifications
- Each subscriber will get all the messages
What are the key features of SNS?
- Up to 12 million subscriptions per topic
- 100k topics limit
- Integrates with a lot of AWS services
How to publish events using SNS?
- Topic publish using the SDK
- Create a topic
- Create a subscription (or many subscriptions)
- Publish to topic
- Direct publish using mobile apps SDK
- Create a platform app
- Create platform endpoint
- Publish to platform endpoint
What are the security strategies for SNS?
- Encryption
- In-flight encryption using HTTP APIs
- At-rest encryption using KMS keys
- Client-side encryption (client will handle encryption/decryption itself)
- Access controls
- IAM policies to regulate access to SNS API
- SNS access policies
- Resource policy (similar to S3 bucket policies)
- Useful for cross-account access
- Useful for other services to write to SNS
What is the fan out pattern involving SNS and SQS?
- Push once in SNS and receive the event in multiple SQS queues that are subscribers
- Fully decoupled with no data loss
How does SNS handle message filtering?
- JSON policy used to filter messages sent to SNS topic subscribers
- If subscriber does not have a filter policy, it receives every message
What is Kinesis?
- Collect, process and analyse streaming data in real-time
- Ingest real-time data eg. logs, metrics, IoT telemetry data
What are the different data types for Kinesis?
- Kinesis Data Streams
- Capture, process and store data streams
- Kinesis Data Firehose
- Load data streams into AWS data stores
- Kinesis Data Analytics
- Analyse data streams with SQL or Apache Flink
- Kinesis Video Streams
- Capture, process and store video streams
How do Kinesis Data Streams work?
- Consist of multiple shards
- Shards are numbered
- Must be provisioned ahead of time
- Producers produce records into Kinesis Data Stream
- Producers can send data at a rate of 1MB/sec (1000 messages/sec/shard)
- Kinesis Data Stream will send data via records to multiple consumers
What are Kinesis producers?
- Puts data records into data streams
- PutRecord API
- Use batching to reduce costs and increase throughput
- Example producers:
- AWS SDK
- Kinesis Producer Library (KPL)
- Kinesis Agent
What are Kinesis consumers?
- Get data records from data streams and process them
- Example consumers:
- AWS SDK
- AWS Lambda
- Kinesis Data Analytics
- Kinesis Data Firehose
- Kinesis Client Library (KCL)
What do records consist of in Kinesis Data Streams?
- Partition key
- Determines which shard the record was from
- Sequence number
- Unique per partition key
- Represents where the record was in the shard
- Data blob
- Value itself
- Up to 1MB
What are the two consumption modes for Kinesis Data Streams?
- Shared (classic)
- Consumers poll data from Kinesis
- 2MB/sec/shard shared across all consumers
- Latency ~200s
- Inexpensive
- Enhanced
- Kinesis pushes data to consumers
- 2MB/sec/shard per consumer
- Latency ~70s
- Expensive
What are the key features of Kinesis Data Streams?
- Retention between 1-365 days
- Ability to reprocess (replay) data
- Immutability
- Once data is inserted in Kinesis, it can’t be deleted
- Ordering
- Data sharing the same partition goes to the same shard
What are the capacity modes for Kinesis Data Streams?
- Provisioned mode
- Choose the number of shards provisioned
- Pay per shard provisioned per hour
- On-demand mode
- Auto-scaling
- Pay per stream per hour
What are the security strategies for Kinesis Data Streams?
- Control access/authorisation using IAM policies
- Encryption
- In-flight encryption using HTTPS APIs
- At-rest encryption using KMS keys
- Client-side encryption
- VPC endpoints available for Kinesis to access within VPC without going through internet
What is the ProvisionThroughputExceeded error and its solution?
- Caused by over producing into a shard
- Limit is 1MB/sec/shard (1000 records/sec)
- Solutions:
- Use a highly distributed partition key
- Implement retries with exponential backoff
- Increase shards (shard splitting)
What is the Kinesis Client Library?
- Java library that helps read records from Kinesis Data Streams with distributed applications sharing read workload
- Each shard is read by 1 KCL instance
- 1 shard = max 1 KCL instance
- 5 shards = max 5 KCL instances
How can we scale Kinesis?
- Scale up via shard splitting
- Scale down via shard merging
What does shard splitting mean in Kinesis?
- Shard split into 2 new shards and old shard is closed - old shard deleted once data expires
- Used to increase stream capacity
- Used to divide a “hot shard”
- Increased cost
- No automatic scaling
- Shards can’t be split into more than 2 shards in a single operation
What does shard merging mean in Kinesis?
- Merge 2 shards with low traffic (aka “cold shards”) - new shard is created and old shard is closed with data deleted once it expires
- Used to decrease stream capacity and save costs
- More than 2 shards can’t be merged in a single operation
What is Kinesis Data Firehose?
- Fully managed, serverless service
- Automatic scaling
- Only pay for data going through Firehose
- Near real-time
- Data is sent in batches
- Min 60s latency for non-full batches or min 1 MB at a time
- Support data transformations using Lambda function
What are the components that make up Kinesis Data Firehose?
- Producers:
- Kinesis Data Streams
- Kinesis Agent
- SDK, KPL
- Optionally transform data via Lambda function
- Optionally send failed writes or all data to a backup S3 bucket
- Consumers (aka destinations):
- 3rd party partner destinations
- AWS destinations
- S3
- Redshift via copying through S3
- OpenSearch
- Custom destination (HTTP endpoint)
What is the difference between Kinesis Data Streams vs Kinesis Data Firehose?
- Kinesis Data Streams
- Streaming service to ingest data at scale
- Manage scaling
- Write custom code for producers/consumers
- Real-time
- Data retention between 1-365 days
- Supports replay capability
- Kinesis Data Firehose
- Streaming service to load data into destinations
- Fully managed
- Near real-time
- Automated scaling
- No data retention
- No support for replay capability
What is Kinesis Data Analytics?
- Real-time analytics on Kinesis Data Streams and Firehose using SQL
- Fully managed
- Automated scaling
- Pay for actual consumption rate
- Use cases:
- Time-series analytics
- Real-time dashboard/metrics
What is the difference between SQS vs SNS vs Kinesis?
- SQS
- Consumer pull data
- Data deleted after being consumed
- Message ordering only on FIFO queues
- SNS
- Uses pub/sub model
- Pushes data to many subscribers where they each get a copy of the data
- Data is not persisted (lost if not delivered)
- Kinesis
- Standard mode (pull data)
- Enhanced mode (push data)
- Message ordering at shard level
What are the differences between EC2 vs Lambda?
- EC2
- Virtual servers in the cloud
- Limited by RAM and CPU
- Continuously running
- Scaling requires intervention to add/remove servers
- Lambda
- Virtual functions - no servers to manage
- Limited by time (max 15 mins)
- Runs on-demand and billed only when invoked
- Automatic scaling
What are the requirements for running Lambda on a container?
- Container image must implement Lambda Runtime API
- ECS/Fargate is preferred for running Docker images
Which services work synchronously with Lambda?
- ALB
- API Gateway
- CloudFront (Lambda@Edge)
- S3 batch
- Cognito
- Step functions
Which services work asynchronously with Lambda?
- S3
- SNS
- CloudWatch Events
- CodeCommit
- CodePipeline
How does Lambda work with ALB?
- Expose Lambda function as HTTP(s) endpoint
- Lambda function must be registered in a target group
- Payload converted from HTTP to JSON to be consumed by Lambda
- Response converted from JSON to HTTP to be consumed by ALB
How does Lambda work with ALB having multi-header values?
- ALB support multi-header values which can be enabled from target group attribute settings
- When enabled, HTTP headers and query string params that are sent with multiple values are shown as an array when passed to Lambda
How does Lambda work with CloudWatch Events/EventBridge?
- Using cronjob or Rate EventBridge rule
- Triggered by a schedule for Lambda to perform a task
- Using CodePipeline EventBridge rule
- Triggered by state changes for Lambda to perform a task
How does Lambda work with S3 event notifications?
- S3 event notifications deliver events in seconds but can sometimes take longer
- Enable versioning on the S3 bucket to ensure notifications are sent for every successful write
- Examples:
- S3:ObjectCreated
- S3:ObjectRemoved
- S3:ObjectRestored
- Use cases:
- Generating thumbnails of images uploaded to S3
- Saving file metadata into database
How does Lambda use Event Source Mapping?
- Used in:
- Kinesis Data Streams
- DynamoDB Streams
- SQS
- Lambda will poll records from source and Event Source Mapper will invoke Lambda with event batch
- Synchronous invocation
How does Lambda work with streams?
- Relates to Kinesis Data Streams and DynamoDB streams
- Event Source Mapper creates an iterator for each shard
- Items processed in order at shard-level
- Processed items are not removed from stream
- Other consumers can still read them
- Low traffic stream
- Use batch window to accumulate records before processing
- High traffic stream
- Process multiple batches in parallel at shard-level
- Items processed in order for each partition key
How does Lambda handle errors with streams?
- By default, if function returns an error then entire batch is reprocessed until it succeeds or items in batch expire
- To ensure in-order processing, processing affected shard is paused until error resolved
- Event Source Mapping can be configured to:
- Discard old events - can be sent to Lambda destination
- Restrict number of retries
- Split batch on error to work around timeout issue
How does Lambda work with queues?
Development
- Relates to SQS
- Event Source Mapper will poll SQS using long-polling
- Recommended to set queue visibility timeout to 6x the Lambda function timeout
- Use Lambda destination (or DLQ) for failures
- If using DLQ, set it up on SQS not on Lambda as DLQ on Lambda is only for asynchronous invocations and this is synchronous
How does Lambda ensure in-order processing of queues?
- Supports in-order processing of FIFO queues
- For standard queues, items are not necessarily processed in order
How does Lambda handle errors with queues?
- Batches are returned to the queue as individual items and might be reprocessed in different grouping to original batch
- Lambda deletes the items from queue after they’re processed successfully
- Source queue can be configured to send failed items to DLQ
How does Lambda handle scalability with streams and queues?
Development
- Streams
- One invocation per stream per shard
- If parallelisation used, up to 10 batches/shard
- SQS standard
- Up to 1000 batches of messages processed
- SQS FIFO
- Scales up to number of active message groups
What is the difference between event object vs context object in Lambda?
- Event object
- Contains data around the invoking service
- Lambda runtime converts event to an object
- Examples:
- Input arguments
- Invoking service arguments
- Context object
- Contains metadata about the invocation itself
- Passed to Lambda at runtime
- Examples:
- function_name
- invoked_function_arn
What are Lambda destinations?
- Async invocations can define destinations to send a result (successful or failed)
- Recommended to use destinations rather than DLQ
- Allow more targets to be destinations whereas DLQ can only be used with SQS/SNS
What are Lambda environment variables?
- Adjusts function behaviour without updating code
- Key/value pair in a string datatype
- Helpful to store secrets encrypted by KMS
What are Lambda Execution Roles and provide examples
- IAM role to grant Lambda function permissions to AWS services
- When using Event Source Mapping, Lambda uses the execution role to read event data
- Recommended to create 1 Lambda Execution Role per function
- Examples:
- AWSLambdaBasicExecutionRole
- Upload logs to CloudWatch
- AWSLambdaKinesisExecutionRole
- Read from Kinesis
- AWSLambdaDynamoDBExecutionRole
- Read from DynamoDB streams
- AWSLambdaBasicExecutionRole
How does Lambda handle logging, monitoring and tracing?
- CloudWatch logs
- Lambda execution logs are stored in CloudWatch
- Lambda function requires appropriate execution role to write to CloudWatch
- CloudWatch metrics
- Lambda metrics displayed in CloudWatch metrics
- X-Ray
- Can be enabled in Lambda config (active tracing)
- Runs X-Ray daemon
- Lambda function requires IAM Execution Role
- AWSXRayDaemonWriteAccess
How does Lambda work with CloudFront?
- CloudFront functions
- Lambda@Edge
What is the difference between CloudFront functions vs Lambda@Edge?
- CloudFront functions
- Light-weight functions written in JS
- For high-scale, latency-sensitive CDN customisations
- Modify viewer req/res
- Lambda@Edge
- Functions written in NodeJS or Python
- Modify all req/res from CloudFront incl. origin req/res
How does Lambda work with VPCs by default?
- Lambda is launched outside of VPC
- Unable to access resources within VPC
What are the requirements for Lambda to be deployed within VPC?
- Define VPC ID, subnets and assign security group to Lambda function
- Lambda will create ENI in your subnets (happens in background) and uses that to access resources within VPC
- Ensure appropriate access execution role is setup
- AWSLambdaVPCAccessExecutionRole
- Ensure appropriate access is given to resources to allow access by ENI
How can Lambda gain access to internet when deployed within VPC?
- By default, Lambda will not have internet access (even if deployed in public subnet or public IP)
- Lambda is required to be deployed within private subnet and route outbound traffic to a NAT gateway/instance in public subnet
- Use VPC endpoints to privately access AWS services without a NAT
What is the execution context in Lambda?
- Temp runtime environment that initialises any external dependencies used by Lambda code
- Great for db connections, HTTP clients, SDK clients
- Incl. /tmp directory
- Execution context is maintained for some time in anticipation of another invocation
- Context is reused to save time during initialisation
What are the benefits of using Lambda layers?
- Custom runtimes
- eg. C++, Rust
- Externalise dependencies to reuse them
- Enable faster function deployments
- Don’t need to repackage dependencies every time
- Can be reused across Lambda functions
How does Lambda handle concurrency and throttling?
- Concurrency limit
- Can request AWS for higher quota
- Can set a reserved concurrency limit at function-level
- Each invocation over concurrency limit will trigger throttling
- Throttling behaviours:
- Synchronous invocations: ThrottleError 429
- Aysnc invocations: auto retry and then DLQ
What can you do to avoid cold starts in Lambda?
- Cold starts
- First request served by new instances have higher latency due to initialisation
- Provision concurrency in advance prior to being invoked
- Cold starts never happen
- All invocations will have lower latency
- ASG can be used to manage concurrency
What are the benefits of using CodeDeploy with Lambda?
- Help automate traffic shift for Lambda aliases
- Feature is integrated within SAM framework
- Deployment types:
- AllAtOnce
- Linear: grow traffic every x minutes until 100%
- Canary: grow traffic by x percent then 100%
- Can create pre/post hooks to check health of function and roll back if necessary
What are the execution and deployment limitations for Lambda?
- Execution
- Memory: 128MB - 10GB (1MB increments)
- The more RAM added, the more vCPU
- If app is CPU-bound (heavy computations), increase RAM
- Max execution time: 15 mins
- If more than 15 mins, better to use ECS/Fargate/EC2 instead
- Disk capacity in /tmp: 512MB - 10GB
- Concurrent executions: 1000
- Memory: 128MB - 10GB (1MB increments)
- Deployment
- Compressed: 50MB
- Uncompressed: 250MB
- Can use /tmp to load larger files at startup
- Environment variables: 4KB
How can we deploy Lambdas with CloudFormation?
- Inline
- Used for simple functions without dependencies
- Add code to Code.ZipFile property in template
- Stored as a zip file in S3
- Refer to S3 zip file location in template
- S3Bucket
- S3Key
- S3ObjectVersion
- If new version uploaded to S3 but template is not updated, CloudFormation will not update the function
- Refer to S3 zip file location in template
What are the requirements to deploy Lambdas with CloudFormation across multiple accounts?
- S3 bucket policy to allow access
- Accounts that are deploying the Lambda function to have an execution role attached
How does Lambda versioning work?
- When working on a Lambda function, it will be tagged as $LATEST
- New version will be created upon publish
- Versions are immutable
- Versions have their own ARN
- Each prior version can still be accessed
What are Lambda aliases?
- Pointers that point to a Lambda function version
- Multiple pointers can be defined to point to different versions
- eg. dev/test/prod
- Aliases are mutable and have own ARN
- Useful for canary deployments by assigning weights to Lambda functions
- Provide end user with stable endpoint as each Lambda publish will result in a new version
- Aliases cannot reference aliases, only versions
What are step functions?
- Low code, visual workflow service
- Manages failures, retries, service integrations and parallelisations
- Used to orchestrate services, automate business processes and build serverless apps
- Model workflows as state machines
- One state machine per workflow
- Useful for order fulfilment, data processing etc
- JSON format
How can workflows be started in step functions?
- SDK call
- API Gateway
- Event Bridge (CloudWatch event)
What are task states in a step function?
- Does the work in your state machine
- Can be used to invoke AWS service or run an activity task
- Use cases for AWS service
- Invoke Lambda function
- Run batch job
- Launch another step function workflow
- Use cases for activity tasks
- Activities poll step functions for work
- Activities send results back to step functions
- Use cases for AWS service
How do step functions perform error handling?
- State can encounter runtime errors for various reasons
- Use retry and catch to handle errors in step functions instead of in application code
What is the result path in a step function?
- A path that determines what input is sent to the next state specified in “Next” field
What is the “wait for task token” feature of step functions?
- Allows you to pause step functions during a task until task token is returned
- Used to wait for human approval, other services, 3rd party integration etc
- Enabled by appending .waitForTaskToken to Resource field
What is the difference between standard vs express step function workflow?
- Standard step function workflow
- Execution model: exactly once
- Execution rate: > 2000/s
- Use cases: non-idempotent actions eg. payment processing
- Express step function workflow
- Execution model: at least once (async) or at most once (synchronous)
- Execution rate: > 100 000/s
- Use cases: streaming data, mobile backends
What is AppSync?
- Managed service that uses GraphQL
- Retrieve real-time data with web socket
- Local data access and synchronisation with mobile apps
What makes up CloudWatch metrics?
- Metrics
- Variable to monitor eg. CPUUtilisation, NetworkIn
- Metrics belong to namespaces
- Metrics have timestamps
- Dimensions
- Attribute of a metric eg. instance ID, environment
What are CloudWatch logs?
- Stores application logs
- Logs can be sent to:
- S3
- Kinesis Data Stream/Kinesis Data Firehose
- Lambda
- OpenSearch
- Can define log expiration dates
- Logs are encrypted by default
- Can setup KMS encryption with your own keys
What are the requirements to obtain CloudWatch logs on EC2 instances?
Monitoring
- By default, no logs from EC2 are sent to CloudWatch
- You need to run a CloudWatch agent on EC2 to push logs to CloudWatch
- CloudWatch log agent can be set up on-premises too
- Ensure correct IAM permissions
What is the difference between CloudTrail vs CloudWatch vs X-Ray?
- CloudTrail
- Audit API calls made by users/services
- Detect unauthorised calls or root cause of changes
- CloudWatch
- CloudWatch Metrics: monitoring
- CloudWatch Logs: storing application logs
- CloudWatch Alarms: sending notifications for unexpected metrics
- X-Ray
- Troubleshoot app performance and errors
- Distributed tracing of microservices
- Useful for latency, errors and fault analysis
What are the use cases for CloudWatch EventBridge?
- Schedule: cron jobs
- ie. schedule Lambda function to run every hour
- Event pattern: event rules to react to a service
- ie. send to SNS with email notification for user log-in events
- Trigger Lambda functions
- Send SQS/SNS messages
What are event buses in EventBridge?
- Event buses receive events from various sources and match them to rules in your account
- Different types of event buses receive events from different sources
- Event bus: event from AWS service
- Partner event bus: events from partner application
- Enabling event discovery on an event bus will generate EventBridge schemas for events on that bus
How can we enable X-Ray?
- Application code must import X-Ray SDK
- SDK will then capture:
- Calls to AWS services
- HTTP/HTTPS requests
- Calls to database
- Calls to SQS
- SDK will then capture:
- Install X-Ray Daemon or enable X-Ray integration
What are common troubleshooting steps for X-Ray running on EC2 vs Lambda?
- If X-Ray is not working on EC2 instance:
- Ensure EC2 IAM role has permissions
- Ensure EC2 is running X-Ray Daemon
- If X-Ray is not working on Lambda:
- Ensure Lambda has correct IAM execution role
- Ensure X-Ray imported in the application code
- Enable Lambda X-Ray active tracing
What are the components that make up X-Ray?
- Segments
- Sent by each application/service
- Subsegments
- Granular details on segments
- Represent your application’s view of downstream calls as a client
- ie. calls to AWS services, HTTP API, SQL db
- Service graph
- JSON document containing information about services/resources that make up your application
- Service graph data is retained for 30 days
- Edges
- Connect services that work together to serve requests
- Trace
- Segments collected together to form an e2e trace from a single request
- Sampling algorithm
- Reduce amount of requests sent to X-Ray by determining which requests get traced
- Annotations
- Key/value pairs used to index traces
- Used with filter expressions
- Metadata
- Key/value pairs not used for indexing/searching
List the X-Ray write APIs and their function
- PutTraceSegments: upload segments to X-Ray
- PutTelemetryRecords: used by X-Ray Daemon to send telemetry
- GetSamplingRules: retrieve sampling rules
- GetSamplingTargets
- GetSamplingStatisticsSummaries
List the X-Ray read APIs and their function
- GetServiceGraph: main graph
- BatchGetTraces: retrieve list of traces specified by ID
- GetTraceSummaries: retrieves IDs and annotations for traces available for a specified time period using optional filter
- GetTraceGraph: retrieve service graph for one or more specific trace IDs
What are the requirements to integrate X-Ray with Elastic Beanstalk?
- Ensure correct IAM permissions
- Application code is importing X-Ray SDK
- Elastic Beanstalk platform includes X-Ray Daemon
- Run the Daemon by enabling the option in the console or with config file
- X-Ray Daemon is not provided for multi-container docker
What are the requirements to integrate X-Ray with ECS?
- ECS cluster
- X-Ray container as a Daemon
- Running X-Ray Daemon container on each EC2 instance
- X-Ray container as a “sidecar”
- Running X-Ray Daemon container alongside each application container within EC2 instance
- X-Ray container as a Daemon
- Fargate cluster
- Can only run X-Ray container as a “sidecar” pattern
What are task definitions in ECS?
- Metadata in JSON format to instruct ECS on how to run docker container
- Image name
- IAM role
- Port bindings for container and host
- Memory and CPU
- Environment variables
- Logging config
- Can define up to 10 containers in a task definition
What is the difference between ECS vs EKS vs Fargate vs ECR?
- ECS (Elastic Container Service)
- Container orchestration service to easily scale containerised applications
- Integrated with ECR and Docker
- EKS (Elastic Kubernetes Service)
- Managed Kubernetes service
- Fargate
- Serverless compute engine for containers
- Works with both ECS and EKS
- ECR (Elastic Container Registry)
- Fully managed Docker container registry
- Store and deploy container images
What is the difference between EC2 vs Fargate launch types in ECS?
- EC2 launch type
- Must provision/maintain infra
- Each EC2 instance must run Docker ECS Agent to register in ECS cluster
- AWS takes care of starting/stopping containers
- Fargate launch type
- Serverless
- Do not need to provision/maintain infra
- Need to create task definitions
- AWS takes care of running ECS tasks based on CPU/RAM
- For scalability, just increase/decrease no. of tasks
What are port mappings and how do they get configured in ECS?
- Port mappings allow containers to access ports on host container to send/receive traffic
- Specified as part of container definition which is configured in task definition
How can ALB find the right port on EC2 instances on EC2 launch type?
- Using dynamic host port mapping if you define only the container port in task definition
- Must allow “any port” on the EC2 instance’s security group from ALB security group
What are task placements in EC2 launch types?
- Contains task placement strategy and constraints for ECS to determine which EC2 instance to add/terminate a new container
What process does ECS follow to select container instances in EC2 launch types?
- Identify instances that satisfy CPU, memory and port requirements in task definition
- Identify instances that satisfy task placement constraints
- Identify instances that satisfy task placement strategies
- Select the instance for task placement and place the task there
What are the different task placement strategies in EC2 launch types?
- Binpack
- Place tasks based on least amount of available CPU/memory
- Minimise no. of instances in use to reduce costs
- Random
- Place tasks randomly
- Spread
- Place tasks evenly based on specified value
- ie. spread on AZ or instanceId
What are the task placement constraints in EC2 launch types?
- distinctInstance
- Place each task on a different container instance
- memberOf
- Place tasks on instances that satisfies an expression
What is the difference between EC2 instance profile role vs ECS task role?
- EC2 instance profile role
- Only for EC2 launch type
- Make container calls to ECS service
- Pull docker image from ECR
- ECS task role
- Available for both launch types
- Allows each task to have specific role
- Task role defined in task definition
How can ECS persist data?
- Using data volumes by mounting EFS onto ECS tasks
- Tasks running in any AZ will share same data in EFS
- Works for both EC2 and Fargate launch types
- S3 cannot be mounted as file system
- Use cases:
- Persistent multi AZ shared storage for containers
What are the different ways for ECS to handle service auto scaling?
- ECS service auto scaling (task level) is not the same as EC2 auto scaling (instance level)
- Use Fargate auto scaling if no EC2 instances involved as it’s much easier to setup
- Strategies:
- Target tracking: scale based on target value for specific CloudWatch metric
- Step scaling: scale based on specified CloudWatch alarm
- Scheduled scaling: scale based on specified date/time
What is the difference between ASG scaling vs ECS cluster capacity provider for EC2 launch type?
- ASG scaling
- Scale ASG based on CPU utilisation
- Add EC2 instances over time
- ECS cluster capacity provider (smarter approach)
- Capacity provider paired with ASG
- Automatically provision and scale infrastructure for ECS tasks
- Add EC2 instances when you’re missing capacity (CPU, RAM etc)
How does ECS handle environment variables?
- These values are fetched and resolved at runtime and injected as environment variables within ECS task
- Non-sensitive information
- Hard-coded eg. URLs
- Sensitive information
- SSM parameter store eg. API keys
- Secrets manager eg. DB passwords
- Environment files (bulk) can also be loaded into S3
How can ECS share data?
- Data volumes (bind mounts)
- Mount EFS volume
- Share data between multiple containers in same task definition
- Works for both EC2 and Fargate launch types
- EC2 launch type
- Uses EC2 instance storage
- Data tied to lifecycle of EC2 instance
- Fargate launch type
- Uses ephemeral storage
- Data tied to containers using them
- EC2 launch type
- Use cases:
- Share ephemeral storage between containers
- “Sidecar” container pattern where sidecar is used to send metrics/logs to other destinations
How does ECS handle load balancing for Fargate launch type?
- Each ECS task has unique private IP
- ECS ENI security group allow port 80 from ALB
- ALB security group allow 80/443 from web
- Only define container port
- Host port not applicable for Fargate
What is CodeDeploy?
- Service that automates application deployment
- Defined in appspec.yml file
- Deploy new app versions to:
- EC2/on-prem servers
- Lambda functions
- ECS
- Automated rollback or trigger CloudWatch alarm
- Gradual deployment control
How does CodeDeploy work with EC2/on-prem platform?
- Perform in-place deployments or blue/green deployments
- Must run CodeDeploy Agent on target instances
- Installed as a prerequisite or via Systems Manager
- EC2 instances must have appropriate permissions to access S3 for deployment bundles
- Define deployment speed:
- AllAtATime: most downtime
- HalfAtATime: reduced capacity by 50%
- OneAtATime: slowest, but lowest impact
- Custom