Design Ads Click Aggregator Flashcards
Trade offs btw dynamo db and Cassandra
You know it
What is replication lag in relational database?
You know it
Why do we want different storage for individual click and aggregated data? Why do we want to store individual clicks?
You know it
How would we miss some click events and how do we handle the data inconsistency here
You know it
How do we in general solve hot partitions?
You know that
What guarantee does Kafka provides when having multiple consumers in a consumer group processing messages from multiple partitions?
You know it
What is offset in Kafka
You know it
How does Kafka decides whether to delete a message
Apache Kafka handles message deletion and retention primarily based on the policies defined at the topic level. Kafka’s approach to message storage is designed to deal with large volumes of data efficiently, making it a popular choice for systems that require reliable, high-throughput, low-latency handling of streaming data.
- Immutable Logs: Kafka stores messages in topics, which are split into partitions. Each partition is essentially an ordered, immutable sequence of messages that is continually appended to—structured as a commit log.
- Retention Policy: The decision to delete messages is based on the retention policy specified for a topic. Kafka does not delete individual messages based on content; rather, it cleans up messages in chunks as specified by the retention settings on a topic.
Kafka offers several configuration options that control how long messages are retained:
-
Time-based Retention (
retention.ms
): This is the maximum time that messages will be retained in a log. For example, if set to two days, messages older than two days will be eligible for deletion during the next log cleanup. -
Size-based Retention (
retention.bytes
): This sets the maximum size in bytes of the log per partition. If the log exceeds this size, older messages are deleted until the log size is under the limit. -
Log Compaction (
cleanup.policy
= “compact”): This is another approach to managing storage. Instead of deleting old messages based on time or size, log compaction ensures that Kafka retains only the latest version of a message for each key within a topic. This is particularly useful for topics that represent state rather than events, as it ensures that the state is fully recoverable from the log but reduces storage space.
-
Log Cleanup Process: Kafka periodically runs a log cleanup process to remove obsolete data. The cleanup process is triggered based on a time interval (
log.cleaner.interval.ms
) or when the log reaches a certain size. - Deletion Details: When deletion is triggered by size or time limits, Kafka removes entire log segments (files in the file system). A log segment becomes eligible for deletion if all messages in the segment are older than the retention period or if deleting it would not exceed the retention size.
- Efficient Storage Management: Kafka’s design allows it to handle deletion efficiently without needing to scan through all messages. By dealing with files (log segments) rather than individual messages, it minimizes the overhead associated with deletion.
Let’s say you have a topic with retention.ms
set to 172800000
(2 days). Any message in this topic older than 2 days will be deleted during the next log cleanup process. However, the actual deletion depends on the log segment that contains the message—if the entire log segment is not older than the threshold, it won’t be deleted until it is.
Kafka’s data retention mechanism ensures efficient use of storage and provides flexibility to meet various application needs through configuration settings. Understanding these settings and their impact on Kafka’s performance and data availability is crucial for effectively managing Kafka clusters and ensuring that the system handles data in compliance with organizational policies or regulatory requirements.
Does Kafka ensures ordering of message across different partitions? What if we do need some form of ordering?
You know that
What is consumer rebalancing in Kafka and what’s the benefits?
You know it