Kafka Basics Flashcards
What is the problem with ingesting real time data directly into streaminng engine?
If that engine goes down or if the load spikes and the system is overwhelmed and starts dropping messages then we cannot process data in real time. We either lose data or the data becomes stale.
How can Kafka messaging system help in case of Real time data processing?
If we place messaging system between Source and Real time processing engine then it solves both problems. If the system goes down then messaging system preserves the messages. If load increases then also the messages are queued and we don’t lose anything.
What is the solution when there are multiple connections or data pipelines between multiple systems and communication looks a mess?
Place a messaging system like Kafka in between and all the communications go through the messaging system. This simplifies the architecture of communication.
Kafka DECOUPLES data pipelines.
Which is a big advantage in terms of producers and consumers in Kafka?
It decouples producers and consumers. So it decouples the data pipelines.
What is Kafka in terms of messaging?
It is a highly-distributed publish-subscribe messaging system. It was originally developed in LinkedIn. Kafka is fast, horizontally scalable, reliable, fault-tolerant, durable, distributed by design.
What is difference between distributed queuing and publish-subscribe systems?
The main difference is that in queuing once a message is read it is taken from the queue and only one will be able to read it. While in case of publish subscribe system if there are multiple consumer groups which are interested in that message then all of them will receive it. Just like in case of mail, all the people who have subscribed the mail chain will get the mail.
Is Kafka push based or pull based?
Kafka is a pull based messaging system. The cosumers have to pull the messages at whichever rate they can support. So if there are two subscribers to particular message stream then one consumer can be faster and ahead but other can be slower and behind.
Which way to read is better in Kafka, single message at a time or batch messages at a time?
Batch of messages at a time is better, it will reduce IO and will give much better performance. Which is multiple Nx times better than single message at a time.
What is the default duration time for storing the data on disk? Retention duration?
It is n days (maybe). After that the data is deleted. It is only restricted by the disk space you have.
Which are Kafka terminologies?
1) Producer
2) Message
3) Consumer
4) Topic
5) Partition
6) Zookeeper
7) Broker
What is a Producer?
A producer can be any application who can publish messages to a topic.
What is a Consumer?
A consumer can be any application that subscribes to a topic and consume the messages.
What is a Topic?
Logically a topic is a feed name to which records are published. Its like labelling traffic in MPLS system. Different label messages may be published by different producers and consumers can express interest only in subscribing to messages of particular labels/topics. Technically there is no visible thing called Topic. A Topic is amalgamation of partitions.
What is a Broker?
Kafka cluster is a set of servers each of which is called a broker. It is the real processing heart of Kafka.
What is a partition?
Topics are broken into “ordered” commit logs called partitions/
Why is a partition called ordered commit log?
Because partition is append only structure. You cannot change what was written. So it is similar to commit log. Ordered because again append mode, messages are read in order of their writing.