Kinesis Flashcards
Kinesis
a managed alternative to Apache Kafka
a big data streaming tool which allows you to collect application logs, metrics, IoT, clickstream, basically anything that is real-time big data.
overall is you know associated with big data real time. (exam)
compatible with
many streaming processing frameworks so you may have heard of Apache Spark, Apache NiFi, etc.
Basically these are frameworks allowing you to perform computations in real time on data that arrives through a stream.
data replication
the data is automatically replicated to 3 Availability Zones.
three sub-Kinesis products
- Kinesis Stream: people also just call Kinesis which is how to ingest streams at scale with a low latency. (exam)
- Kinesis Analytics: to perform real-time analytics on streams using SQL, perform filters, computation, aggregations in real-time
- Firehose: load your stream into other parts of AWS such as S3, Redshift, ElasticSearch and so on.
Now how do we get the streams?
our clickstreams, IoT devices and metrics and logs will be producing data directly into our Kinesis streams.
Kinesis Analytics
Once we have the data into Kinesis streams, Kinesis wants you to process the data and maybe perform computations, metrics, monitoring, alerting, whatever you want and for this you will need to perform some computation in real-time.
Kinesis Firehose
once these computations are done, it’s good to have them stored somewhere into you know, S3, database, Redshift, et cetera.
For example to put it in S3 or in Redshift.
Streams are divided into
ordered Shards or Partitions.
Shard
think of it as one little queue.
we have our producers and they are going to produce
to a Kinesis stream maybe this one has three shards.
And so the data is going to go into either shard and the consumers will be consuming from either shard as well.
if we wanna scale up our stream
we just add shards
if wanted a higher throughput we would increase the number of shards.
in this shard the data is not there forever
By default it’s here for one day. We can set it up so each shard can keep your data up to 7 days.
why would you have such short data retention?
because Kinesis is just a massive highway, it’s a massive pipe. And so you want to process your data do something and put it somewhere else as soon as possible.
difference with SQS
Kinesis is also awesome because it allows you to reprocess and replay data. SQS once the data was consumed it was gone.
with Kinesis the data is still there. And it will expire after some time.
multiple consumers
You’re also able to have multiple applications consume
the same stream so sort of like an SNS of a mindset
We need to just have one stream with a stream of data
and we can have many applications, many consumers consume the same stream.
Kinesis is not a database
Once the data is inserted into Kinesis you cannot delete it.
It’s called immutability. So you add data it’s called a log, you add data over time and then you process it using consumers.
The data will stay in Kinesis for one to seven days and then you do something with it.
shards size
streams are made of many shards.
But a shard represents one megabyte per second or 1000 messages per second on the right side. So the producer can write up to 1000 messages per second
or one megabyte per second.
On the read side you have two megabyte per second
throughput per shard
you’re going to pay for
how much shards you provisioned. And you can have as many shards as you want but if you over provision your shards and you’re not using them up to their full capacity you’re going to overpay. Similarly if you have more throughput than your shards then you’re going to have throughput issues.