Kinesis Flashcards

1
Q

For how long data can be retained in Kinesis data stream?

A

One day to 365 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

is there an ability to replay data in Kinesis data stream?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How many capacity modes are there in Kinesis data stream

A

Provisioned mode:
1. You choose the number of shards provisioned, scale manually or using API
2. Each shard gets 1MB/s in (or 1000 records per second)
3. Each shard gets 2MB/s out (classic or enhanced fan out consumer)
4. You pay per shard provisioned per hour

On demand mode:
1. No need to provision or manage the capacity
2. Default capacity provisioned (4 MB/s in or 4000 records per second)
3. Scales automatically based on observed throughput peak during the last 30 days
4. Pay per stream per hour & data in/out per GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Where is Kinesis data stream deployed - AZ or Region?

A

Region

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What Security features are there for kinesis data stream?

A
  1. Control access / authorization using IAM policies
  2. Encryption in flight using HTTPS endpoints
  3. Encryption at rest using KMS
  4. You can implement encryption/decryption of data on client side (harder)
  5. VPC Endpoints available for Kinesis to access within VPC Monitor API calls using CloudTrail
  6. Monitor API Calls using CloudTrail
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which api or technology is used to produce for kinesis data stream?

A
  1. AWS SDK
  2. Kinesis Producer Library
  3. Kinesis Agent
  4. Apache Spark
  5. Kafka
  6. Third party libraries

Exam would expect you to know this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the API to produce record in Kenisis Producer SDK?

A

PutRecord or PutRecords (for multpile records)
Use case: low throughput, low volume of record, higher latency, simple API, AWS Lambda

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the meaning of ProvisionedThroughputExceededException in AWS kinesis API?

A
  1. It happens when the data exeeds the limits of MB/s or records/sec.
  2. It may happen when you have a hot shard - means that you have lot of records with a same partition key and lot of records are going to the same shard. For example you have a shard for a car order and a mobile phone order but you are having lot of records coming in for mobile phone. The distribution of records accross the shard will help in resovling this error.
    A. increase the number of shards
    B. create a proper partition key that can distribute the data properly
    C. retry capability with backoff
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Very Important

What are the feature of Kinesis Producer Library (KPL)?

A

Batching, compression, Retry, synchronous and asynchronous, cloudwatch

  1. Easy to use and highly configurable C++ / Java library
  2. Used for building high performance, long running producers
  3. Automated and configurable retry mechanism - for handling errors like ProvisionedThroughputExceededException
  4. Synchronous or Asynchronous API (better performance for async) - When you see Synchronous or Asynchronous in Exam think of KPL.
  5. Submits metrics to CloudWatch for monitoring
  6. Batching (both turned on by default) increase throughput, decrease cost:
    A. Collect Records and Write to multiple shards in the same PutRecords API call
    B. Aggregate- increased latency
    * Capability to store multiple records in one record (go over 1000 records per second limit)
    * Increase payload size and improve throughput (maximize 1MB/s limit)
  7. Compression must be implemented by the user to make the records smaller.
  8. KPL Records must be decoded with KCL or special helper library
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Kenisis Batching?

A

A record, and the action is sending it to Kinesis Data Streams. With batching, each HTTP request can carry multiple records instead of just one. In a non-batching situation, you would place each record in a separate Kinesis Data Streams record and make one HTTP request to send it to Kinesis Data Streams.

The KPL supports two types of batching:

Aggregation – Storing multiple records within a single Kinesis Data Streams record.

Collection – Using the API operation PutRecords to send multiple Kinesis Data Streams records to one or more shards in your Kinesis data stream.

The two types of KPL batching are designed to coexist and can be turned on or off independently of one another. By default, both are turned on.

We can influcence the batching efficienty by introducing some delay with RecordMaxBufferedTime

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Where you cannot use KPL (Kinesis Producer Library)?

A

The KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. But, if the applications that cannot tolerate this additional delay should not use KPL and use the AWS SDK directly

If you have an IoT sensor and it comes with the AWS SDK and you want to produce to Kinesis Data Streams, but that IoT sensor sometimes is offline. So if you use the KPL and there is an offline event, the KPL will keep the data and accumulate it. And when my device goes back online, it may take some time to transfer all the data to Kinesis Data Stream. We may only want to act upon the latest data.

So in that case, instead of using the KPL, it would be more sensible for us to implement our applications directly using the SDK API calls name PutRecords because when we use PutRecords, we can choose to discard all data, and we can choose to send only the latest most relevant data when my device is online.

This may come up in the exam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Kenissi Data Stream Agent?

A

Kinesis Agent is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Data Streams. The agent continuously monitors a set of files and sends new data to your stream. The agent handles file rotation, checkpointing, and retry upon failures.

  • Write from multiple directories and write to multiple streams
  • Routing feature based on directory / log file
  • Pre process data before sending to streams (single line, csv to json, log to json…)
  • The agent handles file rotation, checkpointing, and retry upon failures
  • Emits metrics to CloudWatch for monitoring

It installs in Linux based server environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the capacity of shard for consumer and producer?

A

1 MB/sec for producer or 1000 msg/sec
2 MB/s for consumer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are Kinesis consumers?

A
  1. Kinesis SDK
  2. Kinesis Client Library (KCL)
  3. Kinesis Connector Library
  4. Kinesis Firehose
  5. Apache Spark (important for the exam)
  6. AWS Lambda
  7. Kinesis Consumer Enhanced Fan-out
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is the API for getting data from kinesis consumer SDK?

A

GetRecords can get up to 10 megabyte of data or up to 10,000 records, as the capacity of sending data from the shard to the consumer is 2 MBPS. Therefore to get the throughput of 10 megabytes per second you will have to wait for it’s 5 seconds until you get another set of records.

There’s also another limit, you can make up to 5 GetRecord API calls per second. That means calls per shard will get 200 millisecond latency on your data.

If 5 consumers application consume from the same shard, means every consumer can poll once a second and receive less than 400 KB/s - important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the main feature of KCL (Kenesis Client Library)?

A

De-aggregate, share multiple shards with multiple consumers, and checkpoint each record in shard using DynanoDB
Kinesis Client libraries are written in Java but there are similar libraries written in python, ruby, node, .net.
1. Read records from Kinesis produced with the KPL (to de-aggregate the records aggregated by the KPL)
2. Use shard discovery to share multiple shards with multiple consumers in one “group”,
3. if tranmission gets interruputed, use Checkpointing feature to resume
4. Leverages DynamoDB for coordination and checkpointing (one row per shard)
* Make sure you provision enough WCU / RCU
* Or use On Demand for DynamoDB
* Otherwise DynamoDB may slow down KCL
* if you are getting ExpiredIteratorException, you need to increase increase WCU in DynamoDB ( Exam questions)

  1. Record processors will process the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why dynamodb is used in Kinesis client library

A

Amazon DynamoDB is used in the Kinesis Client Library (KCL) to store the state of the processing of the Kinesis data stream. The KCL is a Java library that helps developers consume and process data from Amazon Kinesis streams.

When using the KCL, each record in the Kinesis data stream is processed by a single worker, which maintains its own processing state. The worker needs to keep track of which records have been processed and which are pending. This processing state is stored in DynamoDB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Kinesis Connector library?

A

Old Java library leverages KCL library under the hood. It writes data to Amazon S3, DynamoDB, Redshift, ElasticSearch.

Connector Liberty must be running on EC2 instance
kinesis firehose has replaced the connect library for a few of the targets mentioned above and Lambda has removed some more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Using AWS Lambda for kinesis?

A

Lambda can read records from a Kinesis data stream and the Lambda consumer also has a small library which is really nice used to de-aggregate record from the KPL. So you can produce with a KPL and read from a Lambda consumer using a small library. Now Lambda can be used to do lightweight ETL. So we can send data to Amazon S3, DynamoDB, Rredshift ElasticSearch or really anywhere you want as long as you can program it.

Lambda can also be used to read in real time from Kinesis data streams and to trigger notifications or for example send email in real times or whatever you may want.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Kinesis Enhanced Fanout?

A
  1. New game changing feature from August 2018.
  2. Works with KCL 2.0 and AWS Lambda (Nov 2018)
  3. **Each Consumer get 2 MB/s of provisioned throughput per shard. That means 20 consumers will get 40MB/s per shard aggregated - No more 2 MB/s limit! **
  4. This is possible since Kinesis pushes data to consumers over HTTP/2
  5. Reduce latency (~70 ms)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

When use Enhanced Fan out vs Standard Consumer?

A

Standard consumers:
* Low number of consuming applications (1,2,3…)
* Can tolerate ~200 ms latency
* Minimize cost

Enhanced Fan Out Consumers:
* Multiple Consumer applications for the same Stream
* Low Latency requirements ~70ms
* Higher costs (see Kinesis pricing page)
* Default limit of 20 consumers using enhanced fan out per data stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Kinesis Operations - Adding Shards (a.k.a shard splitting)

A

it is used for increasing the capacity of a shard that is getting lot of data, in other words it’s a hot shard. Normal shard has the capacity of 1 MBPS, whereas with each shard split you can get 1 MBPS of throughput. e.g. if you create two split, you will get the speed of 2mbps.

The old shard is closed and deleted once the data in it is expired.

23
Q

What is merging shard?

A
  1. Decrease the Stream capacity and save costs
  2. Can be used to group two shards with low traffic
  3. Old shards are closed and deleted based on data expiration
24
Q

What is resharding?

A

Resharding is necessary when you need to scale up or down the capacity of your Kinesis data stream. For example, if you need to ingest more data into the stream, you can add more shards to increase the capacity. Conversely, if you have too many shards and don’t need that much capacity, you can remove some shards to reduce costs.

The resharding process involves splitting or merging shards in the data stream. When splitting a shard, a new shard is created, and the data in the original shard is split between the two shards. When merging shards, two smaller shards are combined into one larger shard.

Resharding can be done using the Kinesis API or the AWS Management Console. It is important to note that resharding can impact the ordering of records in the data stream and requires careful planning to avoid data loss or duplication.

25
Q

Exam Use case: a producer is sending data to Kinesis data stream with the correct partition key yet your consumer received the data out of order at some point.

A

The reason for that would be resharding.
After a split, you can read from child shards. However, the data the client application haven’t read yet could be still in the parent.
If you start reading the data from the client shard before completing reading the parent data, you could read data for a particular hash key out of order

After a reshard, read all data from parent before you start pulling the data from the new shard. This logic should be implemented in the client application.

Note: The Kinesis Client Library (KCL) has this logic already built in, even after resharding operations

26
Q

how to handle duplicate records sent by producers in kinesis data stream? (Exam Question)

A

Consider a producer that experiences a network-related timeout after it calls PutRecord, but before it can receive an acknowledgment from Amazon Kinesis Data Streams. The producer was unsure if the record was delivered to Kinesis Data Streams. Assuming that every record is important to the application, the producer would have written to retry the call with the same data. If both PutRecord calls on that same data were successfully committed to Kinesis Data Streams, then there will be two Kinesis Data Streams records. Although the two records have identical data, they also have unique sequence numbers.

Applications that need strict guarantees should embed a primary key within the record to remove duplicates later when processing. **The consumer should have the logic to detect and deal with duplicate records. **

Note that the number of duplicates due to producer retries is usually low compared to the number of duplicates due to consumer retries.

27
Q

how to handle duplicates in consumer for Kinesis data stream?
(Exam Question)

A

Fixes:
* Make your consumer application idempotent, a property of some operations such that you achieve the same result no matter how many times you execute them.
* If the final destination can handle duplicates, it’s recommended to do it there. Like if a duplicate record is inserted, the database will not insert it

28
Q

Kinesis Security?

A
  1. Control access / authorization using IAM policies
  2. Encryption in flight using HTTPS endpoints
  3. Encryption at rest using KMS
  4. Client side encryption must be manually implemented (harder)
  5. VPC Endpoints available for Kinesis to access within VPC
29
Q

what are the key points about kinesis fire hose?

A
  1. The producer falcones his fire hose could be any application, device SDK, KPL, kinesis agent, AWS iot, cloud watch, kinesis data stream
  2. the record is of size 1 megabyte
  3. Kinesis firehose transforms data by using lambda functions
  4. it writes data in batches therefore it’s near real time becuase of the batching there could be a delay of 60 seconds. But Kinesis Data Steam is real time
  5. AWS destinations are amazon S3, amazon red shift (Copy through S3) and Amazon elasticsearch. Splunk is the third party destination - it’s important to remember these three destinations
  6. other destinations are third paty destinations and custom destinations by using HTTP endpoints
  7. either failed data or all data can be backed up in S3 bucket
  8. it’s a fully managed service
  9. It has automatic data scaling
  10. it supports data conversion from JSON to parquet or ORC (only for s3)
  11. it supports compression when target is S3
  12. GZIP is the data that is loaded into Redshift
  13. you only pay for the amount of data that goes through Firehose
  14. Spark and KCL do not read from Kinesis Data Firehose. They only read from Kinesis data stream
30
Q

Where Kinesis Firehose can save data to? (important for the exam)

A

S3, splunk, Redshift, and elastic search
Spark/KCL do not read from KDF (exam may trick saying that KDF can read data from these three, they can read data from Kinesis data stream)

31
Q

What is Kinesis data fireshose delivery system?

A
32
Q

What is Firehose buffer sizing? what are the different buffer kinds?

A

When buffering is enabled, Kinesis Firehose collects incoming data records and stores them in a buffer. The buffer size and buffering interval can be configured based on your requirements. Once the buffer is full or the buffering interval expires, the data records are delivered to the destination.

Buffering helps to improve the reliability and efficiency of data delivery by reducing the number of requests sent to the destination. By batching multiple records together, buffering reduces the overhead of individual record deliveries, which can result in lower costs and better performance.

Kinesis Firehose offers two types of buffering:

Size-based buffering: In size-based buffering, you can specify a maximum buffer size in bytes. Once the buffer reaches the maximum size, Firehose delivers the data to the destination.

Time-based buffering: In time-based buffering, you can specify a maximum buffer time in seconds. Once the buffer has been open for the specified time period, Firehose delivers the data to the destination.

33
Q

Difference between Kinesis Data Stream and Kinesis Firehose?

A

No, Kinesis Firehose cannot be used as a consumer for Kinesis Data Streams.

Kinesis Firehose is designed to load streaming data into destinations like Amazon S3, Amazon Redshift, or Amazon Elasticsearch. It is a fully managed service that takes care of scaling, monitoring, and managing the infrastructure required to deliver the data to the destination.

On the other hand, Kinesis Data Streams is designed for real-time streaming data processing, with multiple consumers reading from a stream in parallel. Consumers of Kinesis Data Streams can process the data using custom applications built with Kinesis Client Library or AWS Lambda.

While both services deal with streaming data, they serve different use cases and have different architectures. It is possible, however, to use Kinesis Data Streams as a source of data for Kinesis Firehose by configuring Firehose to consume from a Kinesis Data Stream.

34
Q

Where can you stream CloudWatch logs?

A
  1. Kinesis data stream
  2. Kinesis Data Firehose
  3. AWS Lambda
35
Q

How can you stream CloudWatch logs?

A

By using CloudWatch log subscription filters. Right now it can only be done by using AWS CLI

36
Q

Exam Question

How to use CloudWatch log subscription Filter Pattern in “Near” Real time to load data into Amazon Elastic Search?

A

Well, you have CloudWatch Logs, and it flows into your subscription filter, and then you link it to Kenesis Data Firehose because we have a near real time requirement.
And we know that Kenesis Data Firehose can send data in real time into Amazon Elastic Search, but also other destinations, for example, Redshift or Amazon S3. But the cool thing is that because we’re using Kenesis Data Firehose, well we can also integrate it with a lender function to do transformations. And that lambda function, for example, can do any kinds of data cleaning or enrichment you may want.

37
Q

Exam Question

How to use CloudWatch log subscription Filter Pattern in “Real” time to load data into Amazon Elastick Search?

A

So we have our subscription filters still, but this time we are going to load data directly into a lambda function, which is going to stream all the records. And the lambda function can use API calls to load data in real time into Amazon Elastic Search. So we’ve removed Kenesis Data Firehose because it was near real time, and instead we’ve replaced it with a lambda function.

37
Q

Exam Question

How to use CloudWatch log subscription Filter Pattern to do real time analysis?

A

In this scenario, we can utilize CloudWatch Logs and Subscription Filters to stream data into a Kinesis Data Stream. This approach allows us to take advantage of Kinesis Data Analytics to perform real-time analytics on the stream. The results of the analytics can then be streamed into a Lambda function, which can perform various actions based on the outcome, such as sending alerts. This lecture highlights that CloudWatch Log Subscriptions can be directed to three different destinations.

38
Q

What is the minimum and maximum buffering interval for Kinesis firehose?

A

The minimum buffering interval for Kinesis Firehose is 60 seconds, and the maximum is 900 seconds (15 minutes).

This means that Firehose will wait for at least 60 seconds or until the buffer size reaches the configured size limit (if using size-based buffering) before delivering the data to the destination. If the buffer is not full, Firehose will continue to collect data until the buffering interval expires or the buffer size limit is reached, whichever comes first.

The maximum buffering interval of 900 seconds allows for efficient use of Firehose resources while also ensuring timely delivery of data to the destination. By adjusting the buffer size and buffering interval, you can optimize Firehose performance to meet your specific needs.

39
Q

Key points for AWS SQS?

A
  • Oldest offering in the AWS portfolio, over 10 years old
  • Fully managed service
  • Can scale from 1 message per second to 10,000s per second
  • Default retention of messages is 4 days, maximum retention is 14 days
  • No limit to the number of messages that can be in the queue
  • Low latency (<10ms on publish and receive)
  • Can scale horizontally in terms of the number of consumers
  • Provides at-least-once message delivery, which means there can be duplicate messages occasionally
  • Provides best-effort ordering, which means there can be out-of-order messages
  • Limitation of 256KB per message sent. (important for exam)
40
Q

Important Exam

what is the difference between AWS SQS and Kinesis Data Stream?

A

Kinesis Data Stream:
- Data can be consumed many times
- Data is deleted after the retention period
- Ordering of records is preserved (at the shard level) even during replays
- Build multiple applications reading from the same stream independently (Pub/sub)
- Offers “Streaming MapReduce” querying capability (Spark, Flink)
- Checkpointing needed to track progress of consumption (ex: KCL with DynamoDB)
- Supports provisioned mode or on-demand mode

SQS:
- Used to decouple applications
- Only one application can access a queue
- Records are deleted after consumption (ack/fail)
- Messages are processed independently for standard queue
- Supports ordering for FIFO queues (decreased throughput)
- Capability to “delay” messages
- Offers dynamic scaling of load (no ops)

41
Q

how can we send large messages to AWS SQS?

A

Instead of sending the actual message, the message can contain an S3 object reference. The receiver can then retrieve the object from S3 to get the complete message.

42
Q

SQS: Maximum of _ _ _ _ _ inflight messages being processed by consumers

A

120,000

43
Q

SQS: Batch Request has a maximum of _ _ _ _ _ _ _ messages of max size _ _ _ _ _ _ _

A

Batch Request has a maximum of 10 messages max 256KB

44
Q

SQS: Message content can be in following format _ _ _ _ _ _ _, _ _ _ _ _ _ _, _ _ _ _ _ _ _

A

Message content is XML, JSON, Unformatted text

45
Q

SQS: Standard queues have an unlimited _ _ _ _ _ _ _

A

SQS: Standard queues have an unlimited transactions per second

46
Q

SQS: FIFO queues support up to _ _ _ _ _ _ _ messages per second (using batching)

A

FIFO queues support up to 3,000 messages per second (using batching)

47
Q

SQS: Max message size is _ _ _ _ _ _ _ (or use Extended Client)

A

Max message size is 256KB (or use Extended Client)

48
Q

SQS: Data retention from _ _ _ _ _ _ _ to _ _ _ _ _ _ _

A

Data retention from 1 minute to 14 days

49
Q

SQS: Pricing: Pay per _ _ _ _ _ _ _, Pay per _ _ _ _ _ _ _ usage

A

Pay per API Request, Pay per network usage

50
Q

Not so important

AWS SQS Security

A
  • Encryption in flight using the HTTPS endpoint
  • Can enable SSE (Server Side Encryption) using KMS
  • Can set the CMK (Customer Master Key) we want to use
  • SSE only encrypts the body, not the metadata (message ID, timestamp, attributes)
  • IAM policy must allow usage of SQS
  • SQS queue access policy
    * Finer grained control over IP
    * Control over the time the requests come in
51
Q

SQS vs Kinesis Use cases

A

SQS use cases:
* Order processing
* Image Processing
* Auto scaling queues according to messages.
* Buffer and Batch messages for future processing.
* Request Offloading

Kinesis Data Streams use cases:
* Fast log and event data collection and processing
* Real Time metrics and reports
* Mobile data capture
* Real Time data analytics
* Gaming data feed
* Complex Stream Processing
* Data Feed from “Internet of Things”

52
Q

What is IoT Message Broker?

A

AWS IoT message broker is a fully-managed publish/subscribe message broker that enables devices and applications to securely send and receive messages from AWS IoT Core. It allows devices to communicate with each other and with cloud-based applications over MQTT, HTTP, and WebSocket protocols. The message broker supports QoS levels 0, 1, and 2, allowing for reliable message delivery. It also provides various security features, such as device authentication, authorization, and encryption, to ensure the security and privacy of messages exchanged over the broker.