Kinesis Flashcards

Question

Exam Use case: a producer is sending data to Kinesis data stream with the correct partition key yet your consumer received the data out of order at some point.

Answer 1

The reason for that would be resharding. After a split, you can read from child shards. However, the data the client application haven’t read yet could be still in the parent. If you start reading the data from the client shard before completing reading the parent data, you could read data for a particular hash key out of order After a reshard, read all data from parent before you start pulling the data from the new shard. This logic should be implemented in the client application. Note: The Kinesis Client Library (KCL) has this logic already built in, even after resharding operations

Answer 2

Consider a producer that experiences a network-related timeout after it calls PutRecord, but before it can receive an acknowledgment from Amazon Kinesis Data Streams. The producer was unsure if the record was delivered to Kinesis Data Streams. Assuming that every record is important to the application, the producer would have written to retry the call with the same data. If both PutRecord calls on that same data were successfully committed to Kinesis Data Streams, then there will be two Kinesis Data Streams records. Although the two records have identical data, they also have unique sequence numbers. Applications that need strict guarantees should embed a primary key within the record to remove duplicates later when processing. **The consumer should have the logic to detect and deal with duplicate records. ** Note that the number of duplicates due to producer retries is usually low compared to the number of duplicates due to consumer retries.

Answer 3

Fixes: * Make your consumer application idempotent, a property of some operations such that you achieve the same result no matter how many times you execute them. * If the final destination can handle duplicates, it’s recommended to do it there. Like if a duplicate record is inserted, the database will not insert it

Answer 4

1. Control access / authorization using IAM policies 2. Encryption in flight using HTTPS endpoints 3. Encryption at rest using KMS 4. Client side encryption must be manually implemented (harder) 5. VPC Endpoints available for Kinesis to access within VPC

Answer 5

1. The producer falcones his fire hose could be any application, device SDK, KPL, kinesis agent, AWS iot, cloud watch, kinesis data stream 2. the record is of size 1 megabyte 3. Kinesis firehose transforms data by using lambda functions 4. it writes data in batches therefore it's near real time becuase of the batching there could be a delay of 60 seconds. But Kinesis Data Steam is real time 5. **AWS destinations are amazon S3, amazon red shift (Copy through S3) and Amazon elasticsearch. Splunk is the third party destination - it's important to remember these three destinations** 6. other destinations are third paty destinations and custom destinations by using HTTP endpoints 7. either failed data or all data can be backed up in S3 bucket 8. it's a fully managed service 9. It has automatic data scaling 10. it supports data conversion from JSON to parquet or ORC (only for s3) 11. it supports compression when target is S3 12. GZIP is the data that is loaded into Redshift 13. you only pay for the amount of data that goes through Firehose 14. Spark and KCL do not read from Kinesis Data Firehose. They only read from Kinesis data stream

Answer 6

**S3, splunk, Redshift, and elastic search** Spark/KCL do not read from KDF (exam may trick saying that KDF can read data from these three, they can read data from Kinesis data stream)

Answer 7

When buffering is enabled, Kinesis Firehose collects incoming data records and stores them in a buffer. The buffer size and buffering interval can be configured based on your requirements. Once the buffer is full or the buffering interval expires, the data records are delivered to the destination. Buffering helps to improve the reliability and efficiency of data delivery by reducing the number of requests sent to the destination. By batching multiple records together, buffering reduces the overhead of individual record deliveries, which can result in lower costs and better performance. Kinesis Firehose offers two types of buffering: Size-based buffering: In size-based buffering, you can specify a maximum buffer size in bytes. Once the buffer reaches the maximum size, Firehose delivers the data to the destination. Time-based buffering: In time-based buffering, you can specify a maximum buffer time in seconds. Once the buffer has been open for the specified time period, Firehose delivers the data to the destination.

Answer 8

No, Kinesis Firehose cannot be used as a consumer for Kinesis Data Streams. Kinesis Firehose is designed to load streaming data into destinations like Amazon S3, Amazon Redshift, or Amazon Elasticsearch. It is a fully managed service that takes care of scaling, monitoring, and managing the infrastructure required to deliver the data to the destination. On the other hand, Kinesis Data Streams is designed for real-time streaming data processing, with multiple consumers reading from a stream in parallel. Consumers of Kinesis Data Streams can process the data using custom applications built with Kinesis Client Library or AWS Lambda. While both services deal with streaming data, they serve different use cases and have different architectures. It is possible, however, to use Kinesis Data Streams as a source of data for Kinesis Firehose by configuring Firehose to consume from a Kinesis Data Stream.

Answer 9

1. Kinesis data stream 2. Kinesis Data Firehose 3. AWS Lambda

Answer 10

By using CloudWatch log subscription filters. Right now it can only be done by using AWS CLI

Answer 11

Well, you have CloudWatch Logs, and it flows into your subscription filter, and then you link it to Kenesis Data Firehose because we have a near real time requirement. And we know that Kenesis Data Firehose can send data in real time into Amazon Elastic Search, but also other destinations, for example, Redshift or Amazon S3. But the cool thing is that because we're using Kenesis Data Firehose, well we can also integrate it with a lender function to do transformations. And that lambda function, for example, can do any kinds of data cleaning or enrichment you may want.

Answer 12

So we have our subscription filters still, but this time we are going to load data directly into a lambda function, which is going to stream all the records. And the lambda function can use API calls to load data in real time into Amazon Elastic Search. So we've removed Kenesis Data Firehose because it was near real time, and instead we've replaced it with a lambda function.

Answer 13

In this scenario, we can utilize CloudWatch Logs and Subscription Filters to stream data into a Kinesis Data Stream. This approach allows us to take advantage of Kinesis Data Analytics to perform real-time analytics on the stream. The results of the analytics can then be streamed into a Lambda function, which can perform various actions based on the outcome, such as sending alerts. This lecture highlights that CloudWatch Log Subscriptions can be directed to three different destinations.

Answer 14

The minimum buffering interval for Kinesis Firehose is 60 seconds, and the maximum is 900 seconds (15 minutes). This means that Firehose will wait for at least 60 seconds or until the buffer size reaches the configured size limit (if using size-based buffering) before delivering the data to the destination. If the buffer is not full, Firehose will continue to collect data until the buffering interval expires or the buffer size limit is reached, whichever comes first. The maximum buffering interval of 900 seconds allows for efficient use of Firehose resources while also ensuring timely delivery of data to the destination. By adjusting the buffer size and buffering interval, you can optimize Firehose performance to meet your specific needs.

Answer 15

* Oldest offering in the AWS portfolio, over 10 years old * Fully managed service * Can scale from 1 message per second to 10,000s per second * Default retention of messages is 4 days, maximum retention is 14 days * No limit to the number of messages that can be in the queue * Low latency (<10ms on publish and receive) * Can scale horizontally in terms of the number of consumers * Provides at-least-once message delivery, which means there can be duplicate messages occasionally * Provides best-effort ordering, which means there can be out-of-order messages * Limitation of 256KB per message sent. (important for exam)

Answer 16

Kinesis Data Stream: - Data can be consumed many times - Data is deleted after the retention period - Ordering of records is preserved (at the shard level) even during replays - Build multiple applications reading from the same stream independently (Pub/sub) - Offers “Streaming MapReduce” querying capability (Spark, Flink) - Checkpointing needed to track progress of consumption (ex: KCL with DynamoDB) - Supports provisioned mode or on-demand mode SQS: - Used to decouple applications - Only one application can access a queue - Records are deleted after consumption (ack/fail) - Messages are processed independently for standard queue - Supports ordering for FIFO queues (decreased throughput) - Capability to “delay” messages - Offers dynamic scaling of load (no ops)

Answer 17

Instead of sending the actual message, the message can contain an S3 object reference. The receiver can then retrieve the object from S3 to get the complete message.

Answer 18

Batch Request has a maximum of 10 messages max 256KB

Answer 19

Message content is XML, JSON, Unformatted text

Answer 20

SQS: Standard queues have an unlimited transactions per second

Answer 21

FIFO queues support up to 3,000 messages per second (using batching)

Answer 22

Max message size is 256KB (or use Extended Client)

Answer 23

Data retention from 1 minute to 14 days

Answer 24

Pay per API Request, Pay per network usage

Answer 25

* Encryption in flight using the HTTPS endpoint * Can enable SSE (Server Side Encryption) using KMS * Can set the CMK (Customer Master Key) we want to use * SSE only encrypts the body, not the metadata (message ID, timestamp, attributes) * IAM policy must allow usage of SQS * SQS queue access policy * Finer grained control over IP * Control over the time the requests come in

Answer 26

SQS use cases: * Order processing * Image Processing * Auto scaling queues according to messages. * Buffer and Batch messages for future processing. * Request Offloading Kinesis Data Streams use cases: * Fast log and event data collection and processing * Real Time metrics and reports * Mobile data capture * Real Time data analytics * Gaming data feed * Complex Stream Processing * Data Feed from “Internet of Things”

Answer 27

AWS IoT message broker is a fully-managed publish/subscribe message broker that enables devices and applications to securely send and receive messages from AWS IoT Core. It allows devices to communicate with each other and with cloud-based applications over MQTT, HTTP, and WebSocket protocols. The message broker supports QoS levels 0, 1, and 2, allowing for reliable message delivery. It also provides various security features, such as device authentication, authorization, and encryption, to ensure the security and privacy of messages exchanged over the broker.