Kinesis Flashcards
What is the default shard limit per Kinesis stream?
500 shards per stream
What is a Kinesis shard?
A shard contains multiple data records, consists of a partition key, sequence number and data payload
What is the read limits from a Kinesis shard?
5 read transactions/sec or 2 MB data per sec
What is the write limits to a Kinesis shard?
1000 write transactions/sec or 1MB data per sec
What is the size limit of a data payload in KDS?
1MB
How do you scale a KDS?
You add or subtract shards in a process called resharding.
By default how long is data retained in Kinesis Data Streams?
24 hours
What is the minimum and maximum retention period of data in Kinesis Data Streams?
24 hours min, 365 days maximum
What is a partition key in KDS?
Attribute that determines which shard data gets sent to. Same kinesis worker processes 1 shard.
Why use Kinesis Firehose over KDS?
Firehose automatically scales, is fully managed and integrates directly with AWS services, but is only near realtime, and data storage limited to 24 hours, no replay
KDS is realtime, low latency, for custom application, able to do data storage, replay records but requires custom work to scale/reshard
What are the 4 main benefits of using KCL?
Kinesis client library allows you to
1. automatically integrate with KPL to de-aggregate records
2. Checkpoints processed records for you
3. Auto balances shard to workers leases if worker or shard counts change
4. Sends custom metrics to CloudWatch automatically
What languages does KCL support?
KCL is written in Java but allows you to use other runtimes like Python via MultiLangDaemon
What is a record processor in KCL?
the logic for how data is processed and is instantiated one record processor per shard by a worker
How many workers are there in KCL?
There is 1 worker per KCL application instance, with 1 or more application instance running in a distributed fashion
How would you resolve issues with throttling on a shard with multiple consumers?
Since read limits on a shard are per shard, you can enable enhanced fan-out which makes the limit the same for each consumer instead of shared by all consumers.
Is KPL synchronous or async?
Can use either one with KPL, but async is default and recommended
How many consumers can there be of a shard?
Multiple consumers can read from a shard
What are the 4 benefits of KPL?
- Increases performance by aggregating small records
- Provides automatic retry logic if there is record failure
- Handles multi-threading, batching, aggregation
- Sends metrics to CloudWatch automatically
What is a downside of KPL?
There can be some extra processing delay due to the wrapper code, up to the RecordMaxBufferedTime
What happens in Firehose if a data producer is sending more data than Firehose is able to deliver to S3
The BufferSize will dynamically increase and attempt to catch up with the delivery stream
Does Firehose support KPL de-aggregation from a KDS?
Yes, de-aggregates before delivering to a destination or before Lambda pre-processing
Which Kinesis option supports native S3 Backup integration?
Firehose supports S3 backup of original source data as well as failed data (processing or delivery failure)
What is a common Firehose task?
Converting record formats from JSON to Parquet or ORC, then storing in S3
Can also have a Firehose Lambda to transform source data into JSON first e.g. CSV into JSON
https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
What are possible data sources for Firehose?
- KDS
- Kinesis Agent
- AWS SDK
- CloudWatch
- AWS IOT
What are the possible output sources for KDA?
- KDS
- Kinesis Firehose
- Lambda
What are the 2 interfaces to write KDA apps?
SQL Interface or Apache Flink (Java) Interface
Can you write KDA to multiple outputs? What is the limit?
Yes you can write to multiple destinations, up to 3
What are the 3 windowed query types for KDA?
- Stagger: aggregate as windows open when data arrives. time based windows, reduces late/out of order/inconsistent arrival data
- Tumbling: aggregate based on windows that open and close on regular intervals, nonoverlapping manner
- Sliding: fixed time or row count interval, continuous aggregation, overlapping windows
Why use MSK over Kinesis?
- MSK has unlimited retention period
- MSK allows greater payload size of 6MB vs Kinesis 1MB
What are possible data sources for KDA Flink? vs. SQL?
- KDS
- MSK
- KDS
- Firehose
What are the downsides of MSK?
- Cluster provisioning model
- 3rd party tooling not integrated with AWS natively
- Scaling is not seamless to clients
What is the Firehose buffer size min/max for S3 and ES?
- S3 is 1MB to 128 MB
- ES is 1MB to 100 MB
What is the Firehose buffer interval?
60 to 900 seconds
What is the payload limit for MSK?
- 8MB
What is the payload limit for Firehose?
- 1024 KB or 1 MB
What is the process of resharding in KDS?
- Merge Shards
- Split Shards
What are the 5 destinations KDS can write to?
- Lambda
- Kinesis Firehose
- Kinesis Data Analytics
- KCL
- Glue Streaming
What are the 3 data sources for KDS?
- KPL
- Kinesis Agent
- PUT to Kinesis API (SDK)
What are the 5 destinations for Kinesis Firehose?
- Redshift
- S3
- OpenSearch aka ES
- Http Endpoint
- Vendor Integration
What are the 2 data sources for KDA?
- KDS
- Kinesis Firehose
How would you get realtime events from CloudWatch? What are the valid destinations?
- Use Cloudwatch Logs with Subscription Filters
- Destinations are Lambda, KDS, Firehose
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html
What streaming service should you use if you have reference data in S3 that needs to be joined/merged?
Kinesis Data Analytics