Kinesis Flashcards

1
Q

Kinesis

A

a managed alternative to Apache Kafka

a big data streaming tool which allows you to collect application logs, metrics, IoT, clickstream, basically anything that is real-time big data.

overall is you know associated with big data real time. (exam)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

compatible with

A

many streaming processing frameworks so you may have heard of Apache Spark, Apache NiFi, etc.

Basically these are frameworks allowing you to perform computations in real time on data that arrives through a stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

data replication

A

the data is automatically replicated to 3 Availability Zones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

three sub-Kinesis products

A
  1. Kinesis Stream: people also just call Kinesis which is how to ingest streams at scale with a low latency. (exam)
  2. Kinesis Analytics: to perform real-time analytics on streams using SQL, perform filters, computation, aggregations in real-time
  3. Firehose: load your stream into other parts of AWS such as S3, Redshift, ElasticSearch and so on.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Now how do we get the streams?

A

our clickstreams, IoT devices and metrics and logs will be producing data directly into our Kinesis streams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Kinesis Analytics

A

Once we have the data into Kinesis streams, Kinesis wants you to process the data and maybe perform computations, metrics, monitoring, alerting, whatever you want and for this you will need to perform some computation in real-time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Kinesis Firehose

A

once these computations are done, it’s good to have them stored somewhere into you know, S3, database, Redshift, et cetera.
For example to put it in S3 or in Redshift.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Streams are divided into

A

ordered Shards or Partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Shard

A

think of it as one little queue.

we have our producers and they are going to produce
to a Kinesis stream maybe this one has three shards.
And so the data is going to go into either shard and the consumers will be consuming from either shard as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

if we wanna scale up our stream

A

we just add shards

if wanted a higher throughput we would increase the number of shards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

in this shard the data is not there forever

A

By default it’s here for one day. We can set it up so each shard can keep your data up to 7 days.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

why would you have such short data retention?

A

because Kinesis is just a massive highway, it’s a massive pipe. And so you want to process your data do something and put it somewhere else as soon as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

difference with SQS

A

Kinesis is also awesome because it allows you to reprocess and replay data. SQS once the data was consumed it was gone.

with Kinesis the data is still there. And it will expire after some time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

multiple consumers

A

You’re also able to have multiple applications consume

the same stream so sort of like an SNS of a mindset

We need to just have one stream with a stream of data
and we can have many applications, many consumers consume the same stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Kinesis is not a database

A

Once the data is inserted into Kinesis you cannot delete it.

It’s called immutability. So you add data it’s called a log, you add data over time and then you process it using consumers.

The data will stay in Kinesis for one to seven days and then you do something with it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

shards size

A

streams are made of many shards.

But a shard represents one megabyte per second or 1000 messages per second on the right side. So the producer can write up to 1000 messages per second
or one megabyte per second.

On the read side you have two megabyte per second
throughput per shard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

you’re going to pay for

A

how much shards you provisioned. And you can have as many shards as you want but if you over provision your shards and you’re not using them up to their full capacity you’re going to overpay. Similarly if you have more throughput than your shards then you’re going to have throughput issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

batch

A

You have ability to batch the messages and the calls.

And this allows you to efficiently push messages into Kinesis.

to reduce costs and increase throughput.

19
Q

the records will be ordered

A

per shard

20
Q

resharding

A

adding a shard

21
Q

merging

A

deleting a shard

22
Q

producers sending data

A

you need to send data in a partition key. So your data is the gray box and the message key is the orange box
and the message key is whatever you want as a string.

And this key will get hashed to determine the shard ID.
So the key is basically a way for you to root the data
to a specific shard.

the same key always goes to the same partition

if you want to get all your data in order for a same key then you would just provide that key to every data point and they will be in order for you.

when your data is produced, now the messages know where to go which shard because of the message key.

23
Q

sequence number

A

the messages when they’re sent to a shard, they get a sequence number and that sequence number is always increasing

24
Q

if you need to choose a partition key

A

you need to choose one that is going to be
highly distributed (exam)
that prevents the hot partition.

if your key wasn’t distributed, then all your data will go through the same shard and one shard will be overwhelmed.

25
Q

if we have an application and you have one million users

A

user ID is a great key right because we have one million users and so realistically all the users will do actions in different times but we’re gonna get ordering for that user ID which is our message key and so user ID is a good one.

Very distributed, very active and useful from a business perspective.

But if you have country ID as a field and it turns out that 90% of your users are in one country say
The United States then it’s not good because
all your country ID will go to one shard.

26
Q

ProvisionedThroughputExceeded

A

And if you get an exception called ProvisionedThroughputExceeded

that’s when you go over the limits.
when we send more data than what was provisioned.
we exceed the number of megabytes per second
or transactions per second.

27
Q

to produce messages you use

A

CLI, but you can use the SDK or producer libraries

from various frameworks.

28
Q

ProvisionedThroughputExceeded solutions

A

And for this you can just

  1. use retries and exponential backoff.
  2. to increase the number of shards
  3. to ensure that the partition key is a good one.
29
Q

you can use a normal consumer using

A

the CLI, the SDK, or the Kinesis Client Library (available in Java, Node, Python, Ruby or .Net.)

30
Q

Kinesis Client Library

A

it uses also DynamoDB to checkpoint the offsets
and to track other workers and share the work amongst shards.

We’ll have a DynamoDB table and the Kinesis app
that uses KCL the client library. We’ll checkpoint the progress through Amazon DynamoDB and then they will synchronize their work between them to consume messages from different shards.

31
Q

Kinesis Security

A
  1. we can control access and authorization to Kinesis
    using IAM policies.
  2. encryption in flight using HTTPS endpoints.
  3. encryption at rest using KMS.
  4. There is a possibility to also encrypt and decrypt
    the data client side but it’s much harder to implement.
    You need to write your own code.
  5. you can also have VPC Endpoints available for Kinesis to access privately within a VPC.
32
Q

Kinesis data firehose

A

a fullly managed service.

There is no administration needed, it scales automatically,

fully serverless.

We’re not going to prevision anything in advance.

It’s going to be near real time. (Kinesis streams was real time) = 60 seconds latency minimum for non full batches.

33
Q

Kinesis data firehose used for

A

to load data into Redshift, Amazon S3, ElasticSearch and Splunk. (exam)

34
Q

Kinesis data firehose minimum data

A

we will write about 32 MB of data as a minimum, at a time to load into these stores.

35
Q

Kinesis data firehose supports

A

many formats, conversions, transformations, and compression.

handy with CSV, JSON

36
Q

Kinesis data firehose ypu pay

A

the amount of data going through Firehose.

you don’t pay for provisioning Firehose. But you do first data streams.

37
Q

Kinesis data firehose data sources

A

could be Kinesis Producer Library, a kineses agent,
a kinesis data stream

Both the agent and the kinesis data streams can send data directly into the data firehose.

can even be CloudWatch Logs or CloudWatch Events.
And then you can do some transformations using a Lambda function.

38
Q

what is the difference between kinesis data streams

and Firehose?

A

Streams =
1. when you write custom code. You need to write your own producer, your own consumer most of the time.

  1. And it’s going to be real time. About 200 ms latency.
  2. You must manage scaling yourself so you must do something called shard splitting or shard merging.
    And so that means that you have to do capacity planing over time.
  3. you can store data and it is going to expire between one to seven days. So if you need a place to just store data for three days.

kinesis data streams is a great way of doing it.

  1. Thanks for this you can do replay capability
  2. it’s multi consumers.

Firehose

  1. fully managed, you’re only provision capacity,
  2. you send data to S3, Splunk, Redshift and ElasticSearch.
  3. serverless so data transfromations with Lambda.
  4. near real time
  5. automated scaling.
  6. there is no data storage. So you cannot replay from Firehose.
39
Q

Kinesis Analytics

A

can take data from kinesis data streams and kinesis data firehose and perform some queries.

the output of these queries can be analyzed by your analytics tools, rear outputs.

performs a real time analytics using SQL.

auto scaling,

managed

no servers to provision,

continuous it’s going to be real time

out of these queries you can create new streams so they can be consumed again by consumers or by kinesis data firehose

40
Q

Kinesis Analytics you only pay for

A

the actual consumption rate of kinesis data analytics.

41
Q

Data Ordering for Kinesis vs SQS FIFO

A

in Kinesis if we have 5 shards and 100 IDs, more or less 20 trucks will be assigned to each shard (partition ID) based on the hashed ID of the objects we are processing

And we can have therefore only 5 parallel consumers

In SQS FIFO there is only one queue. But we can create groups with IDs. So we can have 100 groups based on the IDs of the objects. So we will be able to have 100 parallel consumers.

42
Q

SQS vs SNS vs Kinesis —– SQS

A
  1. the consumers pull data and the data is going to be deleted right after being consumed.
  2. You can have as many consumers as we want
  3. you don’t need to provision throughput, it will scale automatically for you.
  4. There is no ordering guarantee unless you use the FIFO queue, but if you use the FIFO que then you get limited throughputs
  5. there is an individual message delay capability so you can take a message and say be consumed in 15 minutes.
43
Q

SQS vs SNS vs Kinesis —– SNS

A
  1. pub/sub so you push data to many subscribers.
  2. you can have up to 10 million subscribers to one topic, up to 10,000 topics.
  3. the data is not persisted, so that means that it’s lost if not delivered.
  4. you don’t need to provision the throughput in advance
  5. if you wanted to persist the data out, deliver it to many SQS ques
  6. you can use a fan-out architecture to integrate it with SQS
44
Q

SQS vs SNS vs Kinesis —– Kinesis

A
  1. a pull of data, so like SQS we pull data, SNS was pushing data.
  2. We can have as many consumers as we want, but we can only have one consumer per shard.
  3. the possibility to replay data is available; we could reprocess a whole day of data
  4. meant for real-time big data, analytics and ETL, (exam) we wanna do real-time ingestion of data of IOT

any time you real-time big data

  1. There is ordering but it’s at the shard level.
  2. the data expires after X number of days
  3. there is some data retention but it’s temporary.
  4. you must provision your throughput in advance