Kinesis Flashcards
For how long data can be retained in Kinesis data stream?
One day to 365 days
is there an ability to replay data in Kinesis data stream?
Yes
How many capacity modes are there in Kinesis data stream
Provisioned mode:
1. You choose the number of shards provisioned, scale manually or using API
2. Each shard gets 1MB/s in (or 1000 records per second)
3. Each shard gets 2MB/s out (classic or enhanced fan out consumer)
4. You pay per shard provisioned per hour
On demand mode:
1. No need to provision or manage the capacity
2. Default capacity provisioned (4 MB/s in or 4000 records per second)
3. Scales automatically based on observed throughput peak during the last 30 days
4. Pay per stream per hour & data in/out per GB
Where is Kinesis data stream deployed - AZ or Region?
Region
What Security features are there for kinesis data stream?
- Control access / authorization using IAM policies
- Encryption in flight using HTTPS endpoints
- Encryption at rest using KMS
- You can implement encryption/decryption of data on client side (harder)
- VPC Endpoints available for Kinesis to access within VPC Monitor API calls using CloudTrail
- Monitor API Calls using CloudTrail
Which api or technology is used to produce for kinesis data stream?
- AWS SDK
- Kinesis Producer Library
- Kinesis Agent
- Apache Spark
- Kafka
- Third party libraries
Exam would expect you to know this
What is the API to produce record in Kenisis Producer SDK?
PutRecord or PutRecords (for multpile records)
Use case: low throughput, low volume of record, higher latency, simple API, AWS Lambda
What is the meaning of ProvisionedThroughputExceededException in AWS kinesis API?
- It happens when the data exeeds the limits of MB/s or records/sec.
- It may happen when you have a hot shard - means that you have lot of records with a same partition key and lot of records are going to the same shard. For example you have a shard for a car order and a mobile phone order but you are having lot of records coming in for mobile phone. The distribution of records accross the shard will help in resovling this error.
A. increase the number of shards
B. create a proper partition key that can distribute the data properly
C. retry capability with backoff
Very Important
What are the feature of Kinesis Producer Library (KPL)?
Batching, compression, Retry, synchronous and asynchronous, cloudwatch
- Easy to use and highly configurable C++ / Java library
- Used for building high performance, long running producers
- Automated and configurable retry mechanism - for handling errors like ProvisionedThroughputExceededException
- Synchronous or Asynchronous API (better performance for async) - When you see Synchronous or Asynchronous in Exam think of KPL.
- Submits metrics to CloudWatch for monitoring
- Batching (both turned on by default) increase throughput, decrease cost:
A. Collect Records and Write to multiple shards in the same PutRecords API call
B. Aggregate- increased latency
* Capability to store multiple records in one record (go over 1000 records per second limit)
* Increase payload size and improve throughput (maximize 1MB/s limit) - Compression must be implemented by the user to make the records smaller.
- KPL Records must be decoded with KCL or special helper library
What is Kenisis Batching?
A record, and the action is sending it to Kinesis Data Streams. With batching, each HTTP request can carry multiple records instead of just one. In a non-batching situation, you would place each record in a separate Kinesis Data Streams record and make one HTTP request to send it to Kinesis Data Streams.
The KPL supports two types of batching:
Aggregation – Storing multiple records within a single Kinesis Data Streams record.
Collection – Using the API operation PutRecords to send multiple Kinesis Data Streams records to one or more shards in your Kinesis data stream.
The two types of KPL batching are designed to coexist and can be turned on or off independently of one another. By default, both are turned on.
We can influcence the batching efficienty by introducing some delay with RecordMaxBufferedTime
Where you cannot use KPL (Kinesis Producer Library)?
The KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. But, if the applications that cannot tolerate this additional delay should not use KPL and use the AWS SDK directly
If you have an IoT sensor and it comes with the AWS SDK and you want to produce to Kinesis Data Streams, but that IoT sensor sometimes is offline. So if you use the KPL and there is an offline event, the KPL will keep the data and accumulate it. And when my device goes back online, it may take some time to transfer all the data to Kinesis Data Stream. We may only want to act upon the latest data.
So in that case, instead of using the KPL, it would be more sensible for us to implement our applications directly using the SDK API calls name PutRecords because when we use PutRecords, we can choose to discard all data, and we can choose to send only the latest most relevant data when my device is online.
This may come up in the exam
What is Kenissi Data Stream Agent?
Kinesis Agent is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Data Streams. The agent continuously monitors a set of files and sends new data to your stream. The agent handles file rotation, checkpointing, and retry upon failures.
- Write from multiple directories and write to multiple streams
- Routing feature based on directory / log file
- Pre process data before sending to streams (single line, csv to json, log to json…)
- The agent handles file rotation, checkpointing, and retry upon failures
- Emits metrics to CloudWatch for monitoring
It installs in Linux based server environment.
What is the capacity of shard for consumer and producer?
1 MB/sec for producer or 1000 msg/sec
2 MB/s for consumer
What are Kinesis consumers?
- Kinesis SDK
- Kinesis Client Library (KCL)
- Kinesis Connector Library
- Kinesis Firehose
- Apache Spark (important for the exam)
- AWS Lambda
- Kinesis Consumer Enhanced Fan-out
what is the API for getting data from kinesis consumer SDK?
GetRecords can get up to 10 megabyte of data or up to 10,000 records, as the capacity of sending data from the shard to the consumer is 2 MBPS. Therefore to get the throughput of 10 megabytes per second you will have to wait for it’s 5 seconds until you get another set of records.
There’s also another limit, you can make up to 5 GetRecord API calls per second. That means calls per shard will get 200 millisecond latency on your data.
If 5 consumers application consume from the same shard, means every consumer can poll once a second and receive less than 400 KB/s - important
What is the main feature of KCL (Kenesis Client Library)?
De-aggregate, share multiple shards with multiple consumers, and checkpoint each record in shard using DynanoDB
Kinesis Client libraries are written in Java but there are similar libraries written in python, ruby, node, .net.
1. Read records from Kinesis produced with the KPL (to de-aggregate the records aggregated by the KPL)
2. Use shard discovery to share multiple shards with multiple consumers in one “group”,
3. if tranmission gets interruputed, use Checkpointing feature to resume
4. Leverages DynamoDB for coordination and checkpointing (one row per shard)
* Make sure you provision enough WCU / RCU
* Or use On Demand for DynamoDB
* Otherwise DynamoDB may slow down KCL
* if you are getting ExpiredIteratorException, you need to increase increase WCU in DynamoDB ( Exam questions)
- Record processors will process the data
Why dynamodb is used in Kinesis client library
Amazon DynamoDB is used in the Kinesis Client Library (KCL) to store the state of the processing of the Kinesis data stream. The KCL is a Java library that helps developers consume and process data from Amazon Kinesis streams.
When using the KCL, each record in the Kinesis data stream is processed by a single worker, which maintains its own processing state. The worker needs to keep track of which records have been processed and which are pending. This processing state is stored in DynamoDB.
Kinesis Connector library?
Old Java library leverages KCL library under the hood. It writes data to Amazon S3, DynamoDB, Redshift, ElasticSearch.
Connector Liberty must be running on EC2 instance
kinesis firehose has replaced the connect library for a few of the targets mentioned above and Lambda has removed some more
Using AWS Lambda for kinesis?
Lambda can read records from a Kinesis data stream and the Lambda consumer also has a small library which is really nice used to de-aggregate record from the KPL. So you can produce with a KPL and read from a Lambda consumer using a small library. Now Lambda can be used to do lightweight ETL. So we can send data to Amazon S3, DynamoDB, Rredshift ElasticSearch or really anywhere you want as long as you can program it.
Lambda can also be used to read in real time from Kinesis data streams and to trigger notifications or for example send email in real times or whatever you may want.
Kinesis Enhanced Fanout?
- New game changing feature from August 2018.
- Works with KCL 2.0 and AWS Lambda (Nov 2018)
- **Each Consumer get 2 MB/s of provisioned throughput per shard. That means 20 consumers will get 40MB/s per shard aggregated - No more 2 MB/s limit! **
- This is possible since Kinesis pushes data to consumers over HTTP/2
- Reduce latency (~70 ms)
When use Enhanced Fan out vs Standard Consumer?
Standard consumers:
* Low number of consuming applications (1,2,3…)
* Can tolerate ~200 ms latency
* Minimize cost
Enhanced Fan Out Consumers:
* Multiple Consumer applications for the same Stream
* Low Latency requirements ~70ms
* Higher costs (see Kinesis pricing page)
* Default limit of 20 consumers using enhanced fan out per data stream