Kafka Internals Flashcards

Question

Request processing flow in Kafka?

Answer 1

Kafka uses Acceptor-Connector pattern. The network threads are responsible for taking requests from client connections and placing them in request queue. The queued requests are then picked up for processing by processor threads. If it is a fetch request then processor thread will take data out of disk and place the response in response queue. Which are then sent to the consumer. If it is a produce request, then the ack is sent back using the processor threads.

Answer 2

1) ACL permission - Write or Read permission for the topic. 2) Ack - 0, 1 or all MUST be any one from these 3) If the ack is all, they we have enough in sync replicas.

Answer 3

No, kafka does not wait till the data is persisted to disk. It relies on replication for message durability.

Answer 4

If acks is set to all, the request will be stored in buffer called "purgatory" until the leader observes that the follower replicas have replicated the message, at which point the response is sent to the client.

Answer 5

Fetch requests are sent from consumer to broker to request sending data for a particular partition from a topic. Clients also specify the limit to how much data can broker return for each partition. This limit is mandatory, if not potentially broker can send huge amount of data due to which consumer can go out of memory.

Answer 6

The consumer can only consume messages that were replicated across all replicas. If your in-sync replicas have not received latest messages they will get an empty response.

Answer 7

10 * 3 = 30 replicas 30 / 6 = 5 partitions 10 partitions need 10 leaders

Answer 8

Kafka is not a database or persistent storage. It is a durable publish subscribe mechanism. So it does not keep the data forever, nor does it wait for all consumers to read the message before deleting it. Kafka administrator configures the retention period for each topic - Amount of time to store messages before deleting them - Amount of data to store before older messages are purged.

Answer 9

Finding messages that needs deletion in large file and deleting a portion of a file is both time consuming and error prone. Thus each partition is split in segments. By default each segment is either 1 GB of data or week worth data whichever is smaller as the Kafka broker writes to a paritition. If a segment limit is reached, we close the file and start a new one.

Answer 10

Each segment is stored in single data file. The format of the data on the disk is identical to the format of the messages that we send. Using same message format on disk is what allows Kafka to use zero copy optimization when sending message to consumers.

Answer 11

Answer 12

When consumer requests a message, it first goes to the index file with Topic, Partition and Offset number. It gets the segment number in which the data is located, directly goes to that part of the segment and returns data.

Answer 13

No either the whole segment is deleted or none.

Answer 14

Setting the policy for compaction only makes sense for topics which produce events that contain both keys and values.

Answer 15

It deletes older events than retention time to compact and only stores the most recent value for each key in the topic.

Answer 16

The compaction fails.

Answer 17

Only when the segment becomes inactive messages are eligible for compaction. Kafka starts compacting when 50% of the topic contains dirty records. Dirty data means data having same key but different value.

Answer 18

Suppose from the cluster a leader of a partition goes down, and the replicas are out of sync. Then if this property is enabled then out of sync replica will also be chosen as leader. Out of sync replicas that become leaders are called "unclean leaders". Default value is false.

Answer 19

Suppose a leader of a partition goes down and the replicas are out of sync, then if "unclean.leader.election.enable" property is set, any of the out of sync replica can also be chosen as leader. That out of sync leader is called unclean leader.

Answer 20

Then the partition will go offline until we bring the old leader back online. We will not be able to produce any records to that partition.

Answer 21

Suppose we wrote 11-14 offsets on new leader (unclean) and now the old leader comes back alive, kafka will give it priority and try to make it leader. But if old leader already had offsets 11-14 from old time before it went down. Then the records of new unclean leader will be lost. Some consumers may have consumed new messages from new leader (unclean) and some may have consumed same offset old messages from the old leader. So it is possible that two consumers have different set of 11-14 messages.

Answer 22

The problem arises is of duplicates, lets say we are working on banking system and we send message add 10 $ to account. Message is written to Kafka and is processed by consumer but we don't receive ack due to network issue. In that case we will retry the message again and send add 10$ to account again. This will cause account to be credited with 20$. So we don't want that.

Answer 23

To avoid duplicates we should make messages "Idempotent". If we make messages idempotent then it ensures that even if same message is sent twice then it has no negative effect on correctness. Like "Add 10$ to account is NOT idempotent" but "Account balance is 110$ is idempotent".

Answer 24

``` We have some utilities available prepackaged with Kafka that help in testing if configuration we have chosen helps us in meeting our requirement. There is VerifiableProducer and VerifiableConsumer class in org.apache.kafka.tools package. ```

Answer 25

1) It performs complementary check 2) It consumes events and prints out the events it consumed in order. 3) It also prints information related to commits and rebalances.

Answer 26

1) It produces a sequence of messages from 1 to number to provide. 2) You can configure it by setting right number of acks, retries and rate at which messages will be produced.

Answer 27

1) Leader election - What happens if we kill the leader? How long does it take for the producer and consumer to start working again? 2) Controller election - How long does it take the syste to resume after a restart of the controller. 3) Rolling restart - Can I restart the brokers one by one without losing any messages. 4) Unclean leader election test - What happens when we kill all the replicas of the partition one by one (to make sure each goes out of sync) and then restart the broker that was out of sync? What needs to happen in order to resume the operations? Is the behavior acceptable?

Answer 28

Consumer Receive Timestamp - Producer produce timestamp = Latency.

Answer 29

How many events arrive within a specific amount of time.

Answer 30

1) GC parameters 2) Batch size 3) Sync or Async (linger.ms) property 4) Compression

Answer 31

Sets the maximum time to buffer data in aync mode. By default the producer does not wait, It sends the buffer any time data is available. Increase linger.ms for higher latency and higher throughput.

Answer 32

1) GC parameters | 2) Fetch size - Maximum message size a consumer can read. Must be atleast as large as "message.max.bytes".

Answer 33

Maximum message size a consumer can read. Must be atleast as large as "message.max.bytes".

Answer 34

Adding new consumer can increase the overall throughput. Adding consumer group does not have affect on performance.

Answer 35

1) message.max.bytes - Max message size broker will accept. 2) log.segment.bytes - Size of the Kafka data file. Must be larger than any single message. 3) replica.fetch.max.bytes - Maximum message size a broker can replicate. Must be larger than message.max.bytes or a broker can accept messages it cannot replicate resulting in data loss. 4) num.replica.fetchers - The number of threads which will be replicating data from the leader to the follower. If we have threads available then we should have more number of replica fetchers to complete the replication in parallel. 5) num.io.threads - No of threads which put data on disk. Setting the I/O threads directly depends on how much disk you have in your cluster. These threads are used by server to execute the request. We should have atleast as many threads as we have disk.

Kafka Internals Flashcards

(60 cards)