Kinesis Flashcards
Kinesis
a managed alternative to Apache Kafka
a big data streaming tool which allows you to collect application logs, metrics, IoT, clickstream, basically anything that is real-time big data.
overall is you know associated with big data real time. (exam)
compatible with
many streaming processing frameworks so you may have heard of Apache Spark, Apache NiFi, etc.
Basically these are frameworks allowing you to perform computations in real time on data that arrives through a stream.
data replication
the data is automatically replicated to 3 Availability Zones.
three sub-Kinesis products
- Kinesis Stream: people also just call Kinesis which is how to ingest streams at scale with a low latency. (exam)
- Kinesis Analytics: to perform real-time analytics on streams using SQL, perform filters, computation, aggregations in real-time
- Firehose: load your stream into other parts of AWS such as S3, Redshift, ElasticSearch and so on.
Now how do we get the streams?
our clickstreams, IoT devices and metrics and logs will be producing data directly into our Kinesis streams.
Kinesis Analytics
Once we have the data into Kinesis streams, Kinesis wants you to process the data and maybe perform computations, metrics, monitoring, alerting, whatever you want and for this you will need to perform some computation in real-time.
Kinesis Firehose
once these computations are done, it’s good to have them stored somewhere into you know, S3, database, Redshift, et cetera.
For example to put it in S3 or in Redshift.
Streams are divided into
ordered Shards or Partitions.
Shard
think of it as one little queue.
we have our producers and they are going to produce
to a Kinesis stream maybe this one has three shards.
And so the data is going to go into either shard and the consumers will be consuming from either shard as well.
if we wanna scale up our stream
we just add shards
if wanted a higher throughput we would increase the number of shards.
in this shard the data is not there forever
By default it’s here for one day. We can set it up so each shard can keep your data up to 7 days.
why would you have such short data retention?
because Kinesis is just a massive highway, it’s a massive pipe. And so you want to process your data do something and put it somewhere else as soon as possible.
difference with SQS
Kinesis is also awesome because it allows you to reprocess and replay data. SQS once the data was consumed it was gone.
with Kinesis the data is still there. And it will expire after some time.
multiple consumers
You’re also able to have multiple applications consume
the same stream so sort of like an SNS of a mindset
We need to just have one stream with a stream of data
and we can have many applications, many consumers consume the same stream.
Kinesis is not a database
Once the data is inserted into Kinesis you cannot delete it.
It’s called immutability. So you add data it’s called a log, you add data over time and then you process it using consumers.
The data will stay in Kinesis for one to seven days and then you do something with it.
shards size
streams are made of many shards.
But a shard represents one megabyte per second or 1000 messages per second on the right side. So the producer can write up to 1000 messages per second
or one megabyte per second.
On the read side you have two megabyte per second
throughput per shard
you’re going to pay for
how much shards you provisioned. And you can have as many shards as you want but if you over provision your shards and you’re not using them up to their full capacity you’re going to overpay. Similarly if you have more throughput than your shards then you’re going to have throughput issues.
batch
You have ability to batch the messages and the calls.
And this allows you to efficiently push messages into Kinesis.
to reduce costs and increase throughput.
the records will be ordered
per shard
resharding
adding a shard
merging
deleting a shard
producers sending data
you need to send data in a partition key. So your data is the gray box and the message key is the orange box
and the message key is whatever you want as a string.
And this key will get hashed to determine the shard ID.
So the key is basically a way for you to root the data
to a specific shard.
the same key always goes to the same partition
if you want to get all your data in order for a same key then you would just provide that key to every data point and they will be in order for you.
when your data is produced, now the messages know where to go which shard because of the message key.
sequence number
the messages when they’re sent to a shard, they get a sequence number and that sequence number is always increasing
if you need to choose a partition key
you need to choose one that is going to be
highly distributed (exam)
that prevents the hot partition.
if your key wasn’t distributed, then all your data will go through the same shard and one shard will be overwhelmed.