Apache Kafka Flashcards
What is a borker?
The servers (physical or virtual) that hold the queues
What is partition?
An ordered immutable sequence of messages that we append to, like a log file. Each broker can contain multiple partitions
In fact, a topic is partitioned into multiple queues to increase throughput and scale horizontally. Events in a partition are processed in-order.
What is topic?
A topic is a logical grouping of partitions. You publish to and consume from topics in Kafka. Events in a topic are not processed in the order of addition to the queue.
What are producers and consumers?
Producers
Write messages/records to topics
Consumers
Read messages/records from topics. Consumers provide the offset of the message that they want to read.
Explain Consumer group
Consumer Group
Consumer group scales the read throughput and offers guarantee that each message is read exactly once. Other consumers are not going to re-process the same message.
Kafka will ensure that each partition is consumed by only one consumer from that group.
So, if you have a topic with two partitions and only one consumer in a group, that consumer would consume records from both partitions.
After another consumer joins the same group, each consumer would continue consuming only one partition.
If you have more consumers in a group than you have partitions, extra consumers will sit idle, since all the partitions are taken.
How is replication in Kafka?
The replication factor is configured at the topic level, and the unit of replication is the topic partition. The replication uses a single-leader architecture with a difference that all reads and writes go to the leader of the partition(followers only replicate the data)
An In-Sync Replica (ISR) is a set of replicas that are fully caught up with the leader replica of a partition. The ISR list is a dynamic list of replicas that are in sync with the leader. This list is crucial for determining which replicas are eligible to handle failover scenarios and being promoted to a leader when current leader fails
How does ISR work?
How ISR Works
Adding a Replica to ISR: When a new replica is created or when a replica rejoins the Kafka cluster after being out of sync, it starts replicating data from the leader. Once it catches up with the leader’s log, it is added to the ISR list.
Failing to Keep Up: If a replica falls behind the leader by more than a configured threshold (defined by replica.lag.time.max.ms), it is removed from the ISR list. This threshold is designed to ensure that only replicas that are sufficiently up-to-date are considered in-sync.\
Leader Election: If the leader fails, Kafka selects a new leader from the ISR list. This ensures that the new leader has the most recent data, minimizing data loss.
Related parameter: min.insync.replicas
With this ISR model and f+1 replicas, a Kafka topic can tolerate f failures without losing committed messages. Note that Kafka’s guarantee for data loss relies on at least one replica remaining in sync. If all the nodes replicating a partition die, this guarantee is invalidated.
Kafka Usecases
Anytime a message queue is needed
Processing can be done async. For example, youtube transcoding
In-order message processing (ticketmaster waiting queue)
Decouple producer and consumer so they can scale independently
Leetcode or online judge
Anytime an stream is needed
Need to process a lot of data in real-time (ad click aggregator)
Stream of messages to be processed by multiple consumers simultaneously (Messenger or FB Live Comments)