Section 19: AWS Integration & Messaging: SQS, SNS & Kinesis: Kinesis Flashcards

Question

Let's talk about Kinesis Data Stream Security: 1. can you implement encryption/decryption of data on the client side? 2. are vpc endpoints available for kinesis to access within VPC? 3. how can you monitor api calls?

Answer 1

1. you can implement encyrption/decryption of data on the client side, but it is harder 2. vpc endpionts are available for Kinesis to access within VPC 3. you can monitor api calls using CloudTrail

Answer 2

put data records into data streams

Answer 3

* aws sdk (simple producer) * kinesis produver library (kpl): C++, Java, batch, compressoin, retries * kinesis agent (monitor log files)

Answer 4

1MB/sec or 1000 records/sec per shard

Answer 5

* PutRecord API * PutRecords API - this one will help reduce costs and increase throughput

Answer 6

The SubscribeToShard API is a high-performance streaming API that pushes data from shards to consumers over a persistent connection without a request cycle from the client. The SubscribeToShard API uses the HTTP/2 protocol to deliver data to registered consumers whenever new data arrives on the shard, typically within 70 milliseconds, offering approximately 65% faster delivery compared to the GetRecords API. The consumers will enjoy fast delivery even when multiple registered consumers are reading from the same shard.

Answer 7

* Enhanced fan-out is an optional feature for Kinesis Data Streams consumers that provides logical 2 MB/second throughput pipes between consumers and shards. This allows you to scale the number of consumers reading from a data stream in parallel, while maintaining high performance. * You should use enhanced fan-out if you have, or expect to have, multiple consumers retrieving data from a stream in parallel, or if you have at least one consumer that requires the use of the SubscribeToShard API to provide sub-200 millisecond data delivery speeds between producers and consumers.

Answer 8

1. You might be seeing it if your from-producers-traffic grew more than double the previous peak within a 15 minute duration. 2. add more shards, then retry the throttled requests. 3. retries with exponential backoff 4. use highly distributed partition key

Answer 9

Key distribution * An indexed storage system will use keys to determine where it should store, or look for, associated data. One strategy to optimize access to data is to spread that data out over a number of storage locations. As a result, because each location should have less data, it will be faster to find something within a given collection. Consider having one million records: searching for a given record will, in the worst case, require the system to look at one million different elements to find what was asked for. On the other hand, if those one million records were divided into one-thousand groups, being able to limit your search to one group would reduce that worst-case to only one thousand elements. * The relative probability that a key will direct a search to a given location is known as the "key distribution". An even distribution means that, for any given key, the probability of being directed to any location is as likely as another and finding data will therefore be more efficient.

Answer 10

Retries with exponential backoff is a technique that retries an operation, with an exponentially increasing wait time, up to a maximum retry count has been reached (the exponential backoff). https://learn.microsoft.com/en-us/dotnet/architecture/microservices/implement-resilient-applications/implement-retries-exponential-backoff (everything else is from aws and steph's course)

Answer 11

* lambda * kinesis data firehose * kinesis data analytics * custom consumer (aws sdk) - Classic or Enhanced Fan Out * Kinesis Client Library (KCL) - Library to simplify reading from a data stream * apps with an ec2 symbol was also one of the slides, but not another

Answer 12

get data records from data streams and process them

Answer 13

Shared (Classic) Fan-out and Enhanced Fan-out consumer

Answer 14

* With Shared Classic Fan-Out Consumer you get 2 MB/sec per shard across all consumers. That means if four consumers are getting records from Shard 1, then each consumer might be taking up 0.5 MB/sec throughput (it was not mentioned whether throughput was distrubuted easily, i just made up those numbers for the purpose of illustration). * With enhanced fan out consumers you get 2MB/sec per consumer per shard. That means if four consumers are trying to get a record for Shard 1, each consumer might be taking up 2MB/sec throughput, and together the four consumers might be using 8 MB/sec throughput.

Answer 15

For Shared (Classis) Fan-out COnsumer the api is GetRecords(), but for Enhanced Fan-out Consumer it's SubscribeToShard().

Answer 16

Shared (Classic) Fan-Out Consumer

Answer 17

It's the kind where consumers poll data from Kinesis using GetRecords API.

Answer 18

The kind where kinesis pushes data to consumers over HTTP/2 using SubscribeToShard API.

Answer 19

1. yes 2. lambda will retry until it succeeds or data expires.

Answer 20

1. Yes 2. Yes

Answer 21

1. AWS Lambda 2. DynamoDB

Answer 22

A java library that helps read records from a Kinesis Data Stream with distributed applications sharing the read workload.

Answer 23

no it sounds like both java and python have packages and libraries. It seems like libraries tend to have maybe a ton of packages in them? or a bunch of smaller pieces of code anyway.

Answer 24

No. It means that two shards could be being read from the same KCL instance, but that each shard will only get read by one instance. Just look at the picture.

Answer 25

ec2, elastic beanstalk, on-premise

Answer 26

1. yes 2. yes, IAM access

Answer 27

1. shared consumer 2. supports shared and enhanced fan out consumer

Answer 28

1. well i think stream capacity is the number of shards in your stream. 2. you can increase it by shard splitting (a kinesis operation). When you shard split, you turn one "hot shard" into two shards. In split shards, each shard gets 1MB/s data in (incoming?) per shard). You can't split a shard into more than two shards in a single operation. There is no automatic scaling - this happens manually.

Answer 29

* You can decrease stream capacity by merging shards (shard merging is another kinesis operation). * you can merge two shards with low traffic (shards with low traffic are called cold shards, btw). You can't merge more than two shards in a single operation. * old shards are closed and will be deleted once the data is expired.

Answer 30

* Apps * clients * sdks * kinesis agents * AWS IoT * Amazon CloudWatch (Logs and Events) * Kinesis Data Streams

Answer 31

A destination is the data store where your data will be delivered. Kinesis Data Firehose currently supports Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, Datadog, NewRelic, Dynatrace, Sumo Logic, LogicMonitor, MongoDB, and HTTP End Point as destinations.

Answer 32

Streaming ETL is the processing and movement of real-time data from one place to another. ETL is short for the database functions extract, transform, and load. Extract refers to collecting data from some source. Transform refers to any processes performed on that data. Load refers to sending the processed data to a destination, such as a warehouse, a datalake, or an analytical tool.

Answer 33

A delivery stream is the underlying entity of Kinesis Data Firehose. You use Firehose by creating a delivery stream and then sending data to it. You can create an Kinesis Data Firehose delivery stream through the Firehose Console or the CreateDeliveryStream operation. For more information, see Creating a Delivery Stream.

Answer 34

* A record is the data of interest your data producer sends to a delivery stream. The maximum size of a record (before Base64-encoding) is 1024 KB if your data source is Direct PUT or Kinesis Data Streams. The maximum size of a record (before Base64-encoding) is 10 MB if your data source is Amazon MSK. * Okay, this is possibly noteworthy. Steph says that the max size of a record is 1MB. the aws docs says the max size of record (within parameters mentioned above) is 1024 KB. It turns out that 1MB is 1000 KB. So I'm thinking maybe he rounded here, and am wondering where else he might have rounded. So if you see that on the exam go with either. Maybe 1024 first. https://aws.amazon.com/kinesis/data-firehose/faqs/

Answer 35

batch writes.

Answer 36

AWS Lambda/lambda functions.

Answer 37

You pay for data going through FIrehose.

Answer 38

No, it works in near real time. It has 60 second latency minimum for non full batches, or a minimum of 1MB of data at a time (you have the option)

Answer 39

My guess is you'd use Kinesis Data Streams when you're like, "no, i need real time (~200ms) processing/movement/whatever and I don't care if I have to write custom code for my producers/consumers to get that real time stuff. and yes I need data storage for 1 to 365 days and replay capability". And then you'd use Kinesis Data Firehose when you're like "actually i don't care if processing/storage happens in about ~200 ms, I'm perfectly happy with that processing/tranforming/storage stuff taking about 60 seconds and I do like that firehose is aws fully managed and yes I will gladly pay the extra cost for all that. And also I don't need data storage or replay capability".

Answer 40

Nope. 1 is actually Kinesis Data Streams, and 2 is actually Kinesis Data Firehose.

Answer 41

1. sources 2. sinks (eye roll. like they couldn't just use the same term)

Answer 42

1. Kinesis Data Streams 2. inesis Data Firehose

Answer 43

1. SQL Statements 2. Reference Data from S3 (used to enrich streaming data - note that i'm not 100% sure you *have* to add the reference data from S3 and a hands on would probably clear that up)

Answer 44

Okay, unfortunately aws is confusing about this. 1. Per steph's slides, it looks like Kinesis Data Analytics for SQL Applications can connect directly to sink options Kinesis Data Streams (which can connect directly to aws lambda or applications) and Kinesis Data Firehose (which can connect to s3 or redshift-copy-through-s3 or other firehose destinations). 2. *However* the aws doc https://docs.aws.amazon.com/kinesisanalytics/latest/dev/what-is.html is not clearly written. In one place it says what steph says above. But in the next paragraph it says that "Kinesis Data Analytics supports Amazon Kinesis Data Firehose (Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and Splunk), AWS Lambda, and Amazon Kinesis Data Streams as destinations.". It's possible that the secondary paragraph, the one in quotations above, is referring to Kinesis Data Analytics *as a whole* and the thing steph and aws agree on is specific to Kinesis Data Analytics for SQL Applications? But then again, that quote is from a page *titled* Kinesis Data Analytics for SQL Applications. Good luck figuring that out.

Answer 45

1. A: With Amazon Kinesis Data Analytics for SQL Applications, you can process and analyze streaming data using standard SQL (first line from an aws doc). B: real time analytics on Kinesis Data Streams and Firehose using SQL (steph slide) 2. The service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics.

Answer 46

1. you'd use Kinesis Data Stream as an ouput when you want to create streams out of hte real-time analytics queries 2. you'd use Kinesis Data Firehose as an output when you want to send analytics query results to destinations

Answer 47

1. fully managed, no servers to provision 2. automatic scaling 3. pay for actual consumption rate

Answer 48

Amazon Managed Service for Apache Flink

Answer 49

Kinesis Data Analytics for Apache Flink

Answer 50

Stateful Computations over Data Streams Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. source: https://flink.apache.org/, since this quote didn't come from steph or aws.

Answer 51

It uses Flink (Java, Scala or SQL) to process and analyze streaming data.

Answer 52

* Kinesis Data Streams * Amazon MSK (whatever that is) * another slide says that Flink does not read from Firehose, but that you can use Kinesis Analytics for SQL instead.

Answer 53

Yes. Apache Flink applications run on a managed aws cluster (hope i'm getting this wording correct)

Answer 54

* provisioning compute resources, parallel computation, automatic scaling * application backups (implemented as checkpoints and snapshots) * use any apache flink programming features

Answer 55

Yes. application backups are implemented as checkpoints and snapshots.

Answer 56

* Send using "Partition Key" value of the "truck_id. * The same key will always go to the same shard.

Answer 57

For SQS standard, there is no ordering.

Answer 58

For SQS FIFO, if you don't use a Group ID, messages are consumed in the order they are sent, with only one consumer (i think they mean that only one consumer is used with this method).

Answer 59

If you want to scale out the number of consumers and you want the messages to be grouped when they are related to each other, then you use a Group ID (similar to Partition Key in Kinesis).

Answer 60

1. 20 2. yes 3. 5 4. 5 MB/s

Answer 61

1. one 2. 100 3. 100 (due to the 100 group id) 4. 3000 if using batching

Answer 62

1. SQS 2. SNS 3. Kinesis

Section 19: AWS Integration & Messaging: SQS, SNS & Kinesis: Kinesis Flashcards

(94 cards)