Section 19: AWS Integration & Messaging: SQS, SNS & Kinesis: Kinesis Flashcards
What does AWS Kinesis do?
Collect, process, and analyze real-time video and data streams
What kind of real-time data is Kinesis well suited to ingest?
logs, metrics, website clickstreams, IoT telemetry data
There are four types of Kinesis, what are they? (names only)
- Kinesis Data Streams
- Kinesis Data Firehose
- Kinesis Data Analytics
- Kinesis Video Streams
Which Kinesis type is best to: capture, process, and store data streams?
Your options are: Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics, Kinesis Video Streams.
Kinesis Data Streams
Which Kinesis type is best to: load data streams into AWS data stores?
Your options are: Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics, Kinesis Video Streams.
Kinesis Data Firehose
Which Kinesis option is best suited to: analyze data streams with SQL or Apache Flink?
Your options are: Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics, Kinesis Video Streams.
Kinesis Data Analytics
Which Kinesis option is best to: capture, process, and store video streams?
Your options are: Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics, Kinesis Video Streams.
Kinesis Video Streams.
What is a shard, in terms of data?
A part of a dataset, when that dataset has been partitioned.
Can applications, clients, sdk, kinesis produer library (kpl), and kinesis agents all be Kinesis Data Stream producers?
Yes. maybe even at the same time idk.
Is a Kinesis Data Stream stream made up of shards?
yes. Again, as it pertains to data, a shard is a part of a dataset.
Can you scale up the number of shards in a Kinesis Data Stream… stream?
Yes
- What is the max size of a record that can be sent to a Kinesis Data Stream?
- What is the MB/sec throughput at which a record can be sent to a Kinesis Data Stream shard?
- What is the alternative number of messages per sec throughput at which a records can be sent to a Kinesis data stream shard?
- 1 MB.
- 1 MB/sec
- 1000 msg/sec
When sending a record from a producer to a Kinesis stream-shard, what three things does the record consist?
- sequence number (unique per partition-key within shard)
- A partition key (A partition key is used to group data by shard within a stream.) (myst specify while put records into stream)
- data blob (up to 1 MB)
Can all of the following be consumers of Kinesis Data Stream records?
* lambda
* kinesis data firehose
* kinesis data analytics
* custom consumer (aws sdk) - Classic or Enhanced Fan Out
* Kinesis Client Library (KCL) - Library to simplify reading from a data stream
* apps with an ec2 symbol was also one of the slides, but not another
Yes
What are the three things that are in each record being sent from a Kinesis Data Stream stream-shard to a Kinesis Data Stream consumer?
- Partition key
- Sequence number
- data blob
What are the two reates at which records can be sent from a Kinesis Data Stream stream-shard to a Kinesis Data Stream consumer?
- 2 MB/sec (shared) per shard - all consumers
OR - 2 MB/sec (enhanced) per shard per consumer
I’m thinking you can only pick one per kinesis data stream? or maybe only one can happen at a time?
About Kinesis Data Streams:
1. what is the retention
2. do you have the ability to reprocess (replay) data
3. can data be deleted once it’s been inserted into Kinesis?
- between 1 day and 365 (inclusive (including 1 and 365))
- yes
- Nope. data inserted into Kinesis is immutable.
About Kinesis Data Streams:
4. Does data that shares the same partition go into the same shard (this is really confusing, given that I assumed a shard was just a partition).
Yes, they call it ordering.
- What are the producers available to have as a Kinesis Data Stream
- AWS SDK;
- Kinesis Producer Library (KPL);
- Kinesis Agent
- What are the consumers available to a Kinesis Data Stream?
- you can write your own Kinesis Client Library (KCL) or by using AWS SDK
- Alternatively, you can use a managed consumer like aws lambda, kinesis data firehose, or kinesis data analytics.
Kinesis Data Streams have two capacity modes, what are they?
- Provisioned
- on-demand mode
You need the following things, do you use Kinesis Data Stream Provisioned capacity mode, or Kinesis Data Stream On-Demand capacity mode?
* you need to choose the number of shards provisioned, and scale manually or using API
* you need each shard to have up to 1MB/s in (or 1000 records per second)
* you need each shard to get 2MB/s out (for a classic or enhanced fan-out consumer)
* you need to pay per shard provisioned per hour
Capacity mode: Provisioned
You need the following things, do you use Kinesis Data Stream Provisioned capacity mode, or Kinesis Data Stream On-Demand capacity mode?
* you don’t want to provision or manage the capcity
* you’re perfectly happy with 4MB/s in or 4000 records per second (this is the default capacity provisioned for this capacity mode)
* you’re very happy for your stream to scale automatically based on observed throughout peak during the last 30 days
* you’re happy to pay per stream per hour & data in/out per GB
Capacity mode: On-demand mode
Let’s talk about Kinesis Data Stream Security:
1. how do you control access/authorization?
2. can you do encryption in flight? using what?
3. can you do encryption at rest? using what?
- control access/authorization using IAM policies
- encryption in flight using https endpoints
- encryption at rest using kms
Let’s talk about Kinesis Data Stream Security:
1. can you implement encryption/decryption of data on the client side?
2. are vpc endpoints available for kinesis to access within VPC?
3. how can you monitor api calls?
- you can implement encyrption/decryption of data on the client side, but it is harder
- vpc endpionts are available for Kinesis to access within VPC
- you can monitor api calls using CloudTrail
What do Kinesis producers do?
put data records into data streams
What are the Kinesis data stream producer options? (possible these kinesis producers are not specific to data streams, but applicable to the other kinesis subservices (firehose, video stream, analystics)
- aws sdk (simple producer)
- kinesis produver library (kpl): C++, Java, batch, compressoin, retries
- kinesis agent (monitor log files)
Kinesis producer write throughput?
1MB/sec or 1000 records/sec per shard
Kinesis producer APIs?
- PutRecord API
- PutRecords API - this one will help reduce costs and increase throughput
T/F
A hash function is used to map record partition keys (from the producer) to the shard handling that particular set of partition keys.
True
What is the SubscribeToShard API
The SubscribeToShard API is a high-performance streaming API that pushes data from shards to consumers over a persistent connection without a request cycle from the client. The SubscribeToShard API uses the HTTP/2 protocol to deliver data to registered consumers whenever new data arrives on the shard, typically within 70 milliseconds, offering approximately 65% faster delivery compared to the GetRecords API. The consumers will enjoy fast delivery even when multiple registered consumers are reading from the same shard.
Q: What is enhanced fan-out?
Q: When should I use enhanced fan-out?
- Enhanced fan-out is an optional feature for Kinesis Data Streams consumers that provides logical 2 MB/second throughput pipes between consumers and shards. This allows you to scale the number of consumers reading from a data stream in parallel, while maintaining high performance.
- You should use enhanced fan-out if you have, or expect to have, multiple consumers retrieving data from a stream in parallel, or if you have at least one consumer that requires the use of the SubscribeToShard API to provide sub-200 millisecond data delivery speeds between producers and consumers.
- Why might you be seeing “ProvisionedThroughputExceeded”?
- What can you do about it option 1.
- What can you do about it option 2.
- What can you do about it option 3.
- You might be seeing it if your from-producers-traffic grew more than double the previous peak within a 15 minute duration.
- add more shards, then retry the throttled requests.
- retries with exponential backoff
- use highly distributed partition key
What is high distrubution, as it relates to keys?
Key distribution
* An indexed storage system will use keys to determine where it should store, or look for, associated data. One strategy to optimize access to data is to spread that data out over a number of storage locations. As a result, because each location should have less data, it will be faster to find something within a given collection. Consider having one million records: searching for a given record will, in the worst case, require the system to look at one million different elements to find what was asked for. On the other hand, if those one million records were divided into one-thousand groups, being able to limit your search to one group would reduce that worst-case to only one thousand elements.
- The relative probability that a key will direct a search to a given location is known as the “key distribution”. An even distribution means that, for any given key, the probability of being directed to any location is as likely as another and finding data will therefore be more efficient.
What does “retries with exponential backoff” mean?
Retries with exponential backoff is a technique that retries an operation, with an exponentially increasing wait time, up to a maximum retry count has been reached (the exponential backoff).
https://learn.microsoft.com/en-us/dotnet/architecture/microservices/implement-resilient-applications/implement-retries-exponential-backoff (everything else is from aws and steph’s course)
What are some Kinesis Data Stream Consumers?
- lambda
- kinesis data firehose
- kinesis data analytics
- custom consumer (aws sdk) - Classic or Enhanced Fan Out
- Kinesis Client Library (KCL) - Library to simplify reading from a data stream
- apps with an ec2 symbol was also one of the slides, but not another
What do Kinesis Data Stream Consumers do?
get data records from data streams and process them