Messaging Flashcards
List some feature of Amazon SQS
Unlimited Throughput
Unlimited number of messages in the queue
default retention of messages, maximum of 14 days
low latency (< 10 ms on publish and receive)
limit of 256 kb per message sent
can have duplicate messages
can have out of orrder messages
at least once delivery
How does message consumption work?
Consumers are computing entities (EC2, on-premises, Lambda) that poll messages, up to 10 at atime, after processing the message, the consumer deletes the message using the DeleteMessage Api
SQS with ASG (Auto Scaling Group) -
Which metric is responsible to trigger the scaling?
Queue Length - ApproximateNumberOfMessages
Alarm is set up once limit is triggered scaling the consumers
SQS and Security
Encryption:
In-flight enryption using HTTPS API
At-rest encryption using KMS Key
Access Controls:
IAM policies to regulate access to the SQS API
SQS Access policies (Similar to S§ bucket policies):
useful for cross-account access
useful for other services (SNS, S3, …) to write to SQS queues
How do Access Policies work with SQS?
Cross Account access:
Poll messages
specify policy in source account that the consumer in the target account assumes
policy has to specify the account id of target account, the actions to take place, and the arn of the sqs resource
Let S3 write messages to sqs queue
specify in policy the sqs arn, the action to take place, and in the condition section the arn of the s3 bucket as well as the S3 account id
What is Message invisibilty time out in SQS?
After a message is polled by a consumer it is invisible to other consumers for 30 seconds (default)
If message is not deleted during the timeout it returns to the queue
Hence, if a message is not processed and deleted during the time out, it will be processed twice
The ChangeMessageVisibilty API to increase time out from the consumer-side
too high timeout => slow processing
too small timeout => message is processed multiple times
What are Dead Letter Queues in SQS?
A threshold can be set on how many times a message is returned to the queue after timeout
After the MaximumReceives threshold is exceeded, the message is put into the dead letter queue
useful for debugging
make sure to process the messages before they expire
set retention to 14 days
What is a delay queue in SQS?
Delay a message up to 15 minutes until the consumers can see it
Default is 0 seconds
Can be set at queue level
Can override default on send using the DelaySeconds parameter
What is Long Polling in SQS?
When a consumer requests messags from a queue it can optionally wait until a message arrives
Reduces api calls made to the SQS queue, while improving latency and increasing efficiancy of the application
Wait times between 1 sec to 20 sec
Long Polling is preferred to Short Polling using WaitTimeSeconds
What is SQS Extended Client?
A java library to increase message size (256kb) up to 1 gb?
A small metadata message is sent to the queue that contains a pointer to the S3 bucket holding the big data
Hence, the large message is sent to and retrieved from S3
Usefule SQS API calls
CreateQueue (MessageRetentionPeriod), DeleteQueue
PurgeQueue: delete all messages in queue
SendMessage (DelaySeconds), ReceiveMessage, DeleteMessage
MaxNumberOfMessages: default 1, max 10 (for ReceiveMessage Api)
ReceiveMessageWaitTimeSeconds: Long Polling
ChangeMessageVisibility: change the message time out
Batch APIs for SendMessage, DeleteMessage, ChangeMessageVisibility help to reduce costs
What about the FIFO queue in SQS?
More ordering than in a standard queue
Limited Throughput: 300 msg/s without Batch, 3000 msg/s with batch
Exactly once-send capability, by deleting duplicates - feature of fifio queues
Messages are processed in order by the consumer
What is Deduplication in a FIFO queue?
If the same message arrives within 5 minutes, the second one is refused
Two methods:
Content-based Deduplication using a message hash with sha256
Explicilty provide a Message Deduplication ID
What is Message Grouping in a FIFO queue?
If the same value for MessageGroupID is provided then all of these messages will be processed by the same consumer in the order they appear
TO get a level of ordering for a subset of messages then use different values for MessageGroupID
Each Group id can have a different consumer
Ordering across groupd is not guaranteed
Maybe the total ordering of all messages is not needed, only consumer specific
What is AWS Simple Message Service SMS?
Want to send one message to many receivers
uses Pub/Sub
one topic posts to many subscribers
event producer sends messages to only one SNS topic
as many receivers as we want will listen to the SNS topic notifications
each subscriber to the topic will get all the messages (new filter is implemented)
up to 10 mil subs per topic
100k topics limit
Subscribers can be:
Lambda, SQS, Emails, HTTP/S, SMS messages, Mobile Notifications
How is SNS integrated into AWS?
Many AWS services can send data directly to SNS
CloudWatch Alarms
AutoScaling Actions
Amazon S3
How to publish a message with SNS?
Topic Publish:
Create a topic
Create at least one sub
Publish to the topic
Direct Publish:
Create a platform app
Create a platform endpoint
Publish to the platform endpoint
Works with Google GCM, Apple APNS, Amazon ADM
Security with SNS
Encryption:
In-flight encryption with HTTPS API
At-rest encryption using KMS keys
Client-side encryption if the client wants to perform encryption/decryption himself
Access Controls:
IAM policies to regulate access to the SNS API
SNS Access Policies:
Usefule for cross-account access
Useful for other services to write to an SNS topic
What is Fan Out in SNS and SQS?
Push once in SNS, receive in all SQS queues that are subs
Fully decoupled, no data loss
SQS allows for: data persistence, delayed processing, and retries of work
Ability to add more SQS subs over time
Make sure your SQS queue access policy allows for SNS to write
AND MORE
What is AWS Kinesis?
Makes it easy to collect, process, analyze streaming data in real time
4 Kinesis Services:
Kinesis Data Streams: capture, process, and store data streams
Kinesis Data Firehose: load data streams into AWS data stores
Kinesis Data analytics: analyze data streams with Apache Flink or SQL
Kinesis Video Streams: no in the exam
How does Kineses Data Stream work?
Producer (Apps, clients, sdks, kinesis agent) write records (partition key, Sequence no, data blob) to shards of Kinesis Data Stream with 1 mb/sec or 1000 msg/sec per shard
The shards write records (partition key, Sequence no. , data blob) to consumers (Apps, Lambda, Kinesis Data Firehose, Kinesis Data Analytics) with either shared 2 mb/sec per shard all consumers or enhanced 2 mb/sec per shard per consumer
Some more info on Kinesis Data Stream
Billed per shard
Can have as many shards as desired
Retention of data between 1 day (default) and 365 days
Ability to reprocess/replay data
Once data is inserted in Kinesis it cant be deleted (immutable)
data in the same partition goes to the same shard (ordering)
Producers: AWS SDK, Kinesis Producer Library (KPL), Kinesis Agent
Consumers:
my own: Kinesis Client Library, SDK
managed: Lambda, Kinesis Firehose, Kinesis Analytics
Kinesis Data Security
Control Access/Authorization using IAM Policies
Encryption in-flights with HTTPS endpoints
Ecryption at rest using KMS
Encryption on client side can be implemented (harder)
VPC endpoints available to access over VPC network
Monitor API calls using CloudTrail
What do Kinesis Producers do?
Put data records into data streams
A data record consists of: partition key, sequence no, data blob (up to 1 mb)
Producers: AWS SDK, Kinesis Producer Library (KPL) c++, java, batch compression retries, Kinesis Agent (monitor log files)
Write Throughput: 1 mb/sec or 1000 records per shard
PutRecord API
Use batching with PutRecords API to reduce costs and increase throughput
ProvisionedThroughputExceeded: too many records in too short a time for one shard, choose partition key wisely, retries with exponential backoff, increase shards
What do Kinesis Consumers do?
Get data records from data streams and process them
aws lambda, Kinesis Data Analytics, Kinesis Data Firehose
Customer Consumer (AWS SDK) - Classic or enhanced fan out
Kinesis Client Library - library to simplify reading from data stream
Kinesis Consumer Types
Shared Fan-out consumer - pull
low number of consuming applications
Read throughput: 2 mb/sec per shard across all consumers
Max: 5 GetRecords Api call per sec
Latency around 200 ms
cheap
COnsumers poll data from shards using GetRecords api call
Returns up to 10 mb (then throttles for 5 sec) or up to 1000o records
Enhanced fan-out consumer - push
2 mb/sec per consumer/shard
latency around 70 ms
more expensive
Kinesis pushes data to consumers over HTTP/2 - SubscribeToShard API
Soft limit of 5 apps (KCL) per data stream (default)
Lmabda as a Kinesis Consumer
SUpports classic and enhanced fan out consumers
read records in batches
can configure batch size and batch window
uses GetBatch api call
on error, lambda retries until data expires
can process up to 10 batches per shard simultaniuosly
What is the Kinesis Client Library?
A java library that helps distributed apps with the read workload from a Kinesis Data Stream
One KCL instance per shard maximum, less is okay
Progress is checkpointed to Dynmao DB, requires permissions
Track other workers and and share work among shards using dynamo db
KCL can run on EC2, Elastic Beanstalk, and on-premises
Records are read in order at the shard level
Versions: KCL 1.x only shard consumer, KCL 2.x also enhances
Kinesis Shard SPlitting
Used to increase stream capacity (make 2 out of 1 shard e.g.) 1 mb/sec data per shard ?
used to divide a hot shard
old shard is closed and will be deleted once data expires
no automatic scaling
can’t split into more than 2 shard in one operation
Kinesis Merging Shard
Decrease stream capacity and reduce costs
can be used to group two shards with low traffic (cold shards)
old shards are closed and will be deleted once data expires
cant merge more than two shards in one operation
Kinesis Data Firehose
Can take data from Kinesis Data Stream, but alos all producers, CLoudWatch, …
writes data to Destinations:
S3, Redshift (copy through S3), Amazon ElasticSearch
or 3rd party destinations, mongo db etc or my own api
fully managed service, serverless, autoscaling, no administration
pay for data going through firehose
near real time, not real time
support custom data transformation with lambda
can send failed or all data to a backup S3 bucket
Kinesis Data Analytics
Gets Data from Kinesis Data Stream of Data Firehose
enables to apply SQL language to data from sources
Data Analytics than can send this data to a Stream or Firehose, and thus to all endpoints that these two offer
real-time analytics for stream data using SQL
fully managed, no servers to provision
automatic scaling
real-time analytics
pay for actual consumption rate
can create streams out of the real-time queries
use cases: time series analytics, real-time dashboards, real-time metrics
Ordering Data in Kinesis and SQS FIFO