Kinesis Flashcards
Kinesis
a managed alternative to Apache Kafka
a big data streaming tool which allows you to collect application logs, metrics, IoT, clickstream, basically anything that is real-time big data.
overall is you know associated with big data real time. (exam)
compatible with
many streaming processing frameworks so you may have heard of Apache Spark, Apache NiFi, etc.
Basically these are frameworks allowing you to perform computations in real time on data that arrives through a stream.
data replication
the data is automatically replicated to 3 Availability Zones.
three sub-Kinesis products
- Kinesis Stream: people also just call Kinesis which is how to ingest streams at scale with a low latency. (exam)
- Kinesis Analytics: to perform real-time analytics on streams using SQL, perform filters, computation, aggregations in real-time
- Firehose: load your stream into other parts of AWS such as S3, Redshift, ElasticSearch and so on.
Now how do we get the streams?
our clickstreams, IoT devices and metrics and logs will be producing data directly into our Kinesis streams.
Kinesis Analytics
Once we have the data into Kinesis streams, Kinesis wants you to process the data and maybe perform computations, metrics, monitoring, alerting, whatever you want and for this you will need to perform some computation in real-time.
Kinesis Firehose
once these computations are done, it’s good to have them stored somewhere into you know, S3, database, Redshift, et cetera.
For example to put it in S3 or in Redshift.
Streams are divided into
ordered Shards or Partitions.
Shard
think of it as one little queue.
we have our producers and they are going to produce
to a Kinesis stream maybe this one has three shards.
And so the data is going to go into either shard and the consumers will be consuming from either shard as well.
if we wanna scale up our stream
we just add shards
if wanted a higher throughput we would increase the number of shards.
in this shard the data is not there forever
By default it’s here for one day. We can set it up so each shard can keep your data up to 7 days.
why would you have such short data retention?
because Kinesis is just a massive highway, it’s a massive pipe. And so you want to process your data do something and put it somewhere else as soon as possible.
difference with SQS
Kinesis is also awesome because it allows you to reprocess and replay data. SQS once the data was consumed it was gone.
with Kinesis the data is still there. And it will expire after some time.
multiple consumers
You’re also able to have multiple applications consume
the same stream so sort of like an SNS of a mindset
We need to just have one stream with a stream of data
and we can have many applications, many consumers consume the same stream.
Kinesis is not a database
Once the data is inserted into Kinesis you cannot delete it.
It’s called immutability. So you add data it’s called a log, you add data over time and then you process it using consumers.
The data will stay in Kinesis for one to seven days and then you do something with it.
shards size
streams are made of many shards.
But a shard represents one megabyte per second or 1000 messages per second on the right side. So the producer can write up to 1000 messages per second
or one megabyte per second.
On the read side you have two megabyte per second
throughput per shard
you’re going to pay for
how much shards you provisioned. And you can have as many shards as you want but if you over provision your shards and you’re not using them up to their full capacity you’re going to overpay. Similarly if you have more throughput than your shards then you’re going to have throughput issues.
batch
You have ability to batch the messages and the calls.
And this allows you to efficiently push messages into Kinesis.
to reduce costs and increase throughput.
the records will be ordered
per shard
resharding
adding a shard
merging
deleting a shard
producers sending data
you need to send data in a partition key. So your data is the gray box and the message key is the orange box
and the message key is whatever you want as a string.
And this key will get hashed to determine the shard ID.
So the key is basically a way for you to root the data
to a specific shard.
the same key always goes to the same partition
if you want to get all your data in order for a same key then you would just provide that key to every data point and they will be in order for you.
when your data is produced, now the messages know where to go which shard because of the message key.
sequence number
the messages when they’re sent to a shard, they get a sequence number and that sequence number is always increasing
if you need to choose a partition key
you need to choose one that is going to be
highly distributed (exam)
that prevents the hot partition.
if your key wasn’t distributed, then all your data will go through the same shard and one shard will be overwhelmed.
if we have an application and you have one million users
user ID is a great key right because we have one million users and so realistically all the users will do actions in different times but we’re gonna get ordering for that user ID which is our message key and so user ID is a good one.
Very distributed, very active and useful from a business perspective.
But if you have country ID as a field and it turns out that 90% of your users are in one country say
The United States then it’s not good because
all your country ID will go to one shard.
ProvisionedThroughputExceeded
And if you get an exception called ProvisionedThroughputExceeded
that’s when you go over the limits.
when we send more data than what was provisioned.
we exceed the number of megabytes per second
or transactions per second.
to produce messages you use
CLI, but you can use the SDK or producer libraries
from various frameworks.
ProvisionedThroughputExceeded solutions
And for this you can just
- use retries and exponential backoff.
- to increase the number of shards
- to ensure that the partition key is a good one.
you can use a normal consumer using
the CLI, the SDK, or the Kinesis Client Library (available in Java, Node, Python, Ruby or .Net.)
Kinesis Client Library
it uses also DynamoDB to checkpoint the offsets
and to track other workers and share the work amongst shards.
We’ll have a DynamoDB table and the Kinesis app
that uses KCL the client library. We’ll checkpoint the progress through Amazon DynamoDB and then they will synchronize their work between them to consume messages from different shards.
Kinesis Security
- we can control access and authorization to Kinesis
using IAM policies. - encryption in flight using HTTPS endpoints.
- encryption at rest using KMS.
- There is a possibility to also encrypt and decrypt
the data client side but it’s much harder to implement.
You need to write your own code. - you can also have VPC Endpoints available for Kinesis to access privately within a VPC.
Kinesis data firehose
a fullly managed service.
There is no administration needed, it scales automatically,
fully serverless.
We’re not going to prevision anything in advance.
It’s going to be near real time. (Kinesis streams was real time) = 60 seconds latency minimum for non full batches.
Kinesis data firehose used for
to load data into Redshift, Amazon S3, ElasticSearch and Splunk. (exam)
Kinesis data firehose minimum data
we will write about 32 MB of data as a minimum, at a time to load into these stores.
Kinesis data firehose supports
many formats, conversions, transformations, and compression.
handy with CSV, JSON
Kinesis data firehose ypu pay
the amount of data going through Firehose.
you don’t pay for provisioning Firehose. But you do first data streams.
Kinesis data firehose data sources
could be Kinesis Producer Library, a kineses agent,
a kinesis data stream
Both the agent and the kinesis data streams can send data directly into the data firehose.
can even be CloudWatch Logs or CloudWatch Events.
And then you can do some transformations using a Lambda function.
what is the difference between kinesis data streams
and Firehose?
Streams =
1. when you write custom code. You need to write your own producer, your own consumer most of the time.
- And it’s going to be real time. About 200 ms latency.
- You must manage scaling yourself so you must do something called shard splitting or shard merging.
And so that means that you have to do capacity planing over time. - you can store data and it is going to expire between one to seven days. So if you need a place to just store data for three days.
kinesis data streams is a great way of doing it.
- Thanks for this you can do replay capability
- it’s multi consumers.
Firehose
- fully managed, you’re only provision capacity,
- you send data to S3, Splunk, Redshift and ElasticSearch.
- serverless so data transfromations with Lambda.
- near real time
- automated scaling.
- there is no data storage. So you cannot replay from Firehose.
Kinesis Analytics
can take data from kinesis data streams and kinesis data firehose and perform some queries.
the output of these queries can be analyzed by your analytics tools, rear outputs.
performs a real time analytics using SQL.
auto scaling,
managed
no servers to provision,
continuous it’s going to be real time
out of these queries you can create new streams so they can be consumed again by consumers or by kinesis data firehose
Kinesis Analytics you only pay for
the actual consumption rate of kinesis data analytics.
Data Ordering for Kinesis vs SQS FIFO
in Kinesis if we have 5 shards and 100 IDs, more or less 20 trucks will be assigned to each shard (partition ID) based on the hashed ID of the objects we are processing
And we can have therefore only 5 parallel consumers
In SQS FIFO there is only one queue. But we can create groups with IDs. So we can have 100 groups based on the IDs of the objects. So we will be able to have 100 parallel consumers.
SQS vs SNS vs Kinesis —– SQS
- the consumers pull data and the data is going to be deleted right after being consumed.
- You can have as many consumers as we want
- you don’t need to provision throughput, it will scale automatically for you.
- There is no ordering guarantee unless you use the FIFO queue, but if you use the FIFO que then you get limited throughputs
- there is an individual message delay capability so you can take a message and say be consumed in 15 minutes.
SQS vs SNS vs Kinesis —– SNS
- pub/sub so you push data to many subscribers.
- you can have up to 10 million subscribers to one topic, up to 10,000 topics.
- the data is not persisted, so that means that it’s lost if not delivered.
- you don’t need to provision the throughput in advance
- if you wanted to persist the data out, deliver it to many SQS ques
- you can use a fan-out architecture to integrate it with SQS
SQS vs SNS vs Kinesis —– Kinesis
- a pull of data, so like SQS we pull data, SNS was pushing data.
- We can have as many consumers as we want, but we can only have one consumer per shard.
- the possibility to replay data is available; we could reprocess a whole day of data
- meant for real-time big data, analytics and ETL, (exam) we wanna do real-time ingestion of data of IOT
any time you real-time big data
- There is ordering but it’s at the shard level.
- the data expires after X number of days
- there is some data retention but it’s temporary.
- you must provision your throughput in advance