Kafka Cluster Architectures and Administering Kafka Flashcards
When do we need cross cluster mirroring?
When there are multiple clusters and they are interdependent on each other and the administrators need to continuously copy data between the clusters.
Copying of data between clusters in called Mirroring.
Apache Kafka’s built in cross cluster replicator is called Apache MirrorMaker.
What is mirroring?
Copying of data from one cluster to other is called Mirroring.
Use cases of mirroring?
1) Regional and central clusters - Suppose you have regional kafka application which write data to regional cluster and there is also a central cluster which is used by central team.
2) Disaster Recovery (DR) Redundancy
3) Cloud migrations - A new application which is deployed on the cloud but also needs some data that is updated by applications runninng in on-premise datacentre.
Drawbacks of Cross-datacenter communication?
1) High latencies
2) Limited bandwith
3) High costs
Which are different types of Multi-cluster architectures?
1) Hub and Spokes Architecture
2) Active-Active Architecture
3) Active-Standby Architecture
4) Stretch Clusters
What is Hub and Spokes architecture?
Where there are multiple local Kafka clusters and we need all of the data in centralized Kafka cluster for motinoring or something. This is called Hub and Spokes architecture.
When using this architecture, for each regional datacenter we need atleast one mirroring process on the central datacenter.
Advantage and Disadvantage of Hub and Spokes architecture?
Disadvantage:
Data from one location will not be available in other location.
Advantage:
Data is always produced at local cluster and events from each datacenter are only mirrored once - to the central datacenter.
Applications that process data from local datacenter can be located at local location.
Applications that need to process data from multiple datacenters will be located at the central datacenter.
Architecture is simple to deploy, configure and monitor.
Define Active-Active architecture?
Data from A is copied to cluster B, Data from cluster B will be copied to cluster A. Same is true for cluster C and B, A.
Define Active-Standby architecture?
All the data which is produced to A, the same data will be copied to B. It is used in Disaster Recovery.
Define Stretch clusters?
TBA
What is Apache MirrorMaker?
Kafka contains a simple tool for mirroring data between two datacenters called MirrorMaker.
What MirrorMaker internally has?
1) MirrorMaker has single producer
2) MirrorMaker has multiple consumers
3) Each consumer is running in its own thread.
4) Each consumer consumes events from the topics and partitions it was assigned to on the source cluster and use the shared producer to send those events to the target cluster.
5) Every 60 seconds (default), the consumers tell producer to send all the events it has to Kafka and wait until ack is received for these events.
6) Then the consumers contact the source Kafka Cluster to commit the offsets for those events. This guarantees no data loss.
Which is the most important pre-requisite before starting mirroring in MirrorMaker?
The topics configured in source cluster must be present in destination cluster.
Sample MirrorMaker command
sh kafka-mirror-maker –consumer.config –producer.config –new.consumer –num.streams 2 -whitelist “.*”
whitelist tells the list of topics that need to be mirrorred.
What configuration needs to be provided in MirrorMaker consumer.config properties?
1) group.id - All consumers in the MirrorMaker share same configuration, which means there can be only one source cluster and one group id. So all the consumers are part of same consumer group.
2) bootstrap.servers
MirrorMaker automatically commits offsets and mirrormaker default starts replicating latest events that are written after MirrorMaker starts.
What does num.streams parameter represent in MirrorMaker command?
It represents the number of consumers that will be used.
What configuration needs to be provided in MirrorMaker command producer.config properties?
Only mandatory configuration is bootstrap.servers
How many source clusters can be there for single process of MirrorMaker?
There can only be one source cluster per MirrorMaker process.
What is new.consumer property in MirrorMaker command?
It represents new version of consumer is to be used and not old consumer.
What is num.streams property in MirrorMaker command?
Each stream is another consumer reading from the source cluster.
All consumers in the same MirrorMaker share the same producer.
It will take multiple consumers to saturate producer.
If we need more throughput after this point, we will have to create multiple MirrorMaker processes.
What is whitelist property in MirrorMaker command?
A regular expression of the topics that need to be mirrored. All the topic names that match the regular expression will be mirrored.
Which are important things to monitor when deploying MirrorMaker in production?
1) Lag monitoring - The lag is difference between latest message in the source Kafka and the latest message in destination.
2) Metrics monitoring - Collect and monitor metrics available in MirrorMaker producer and consumer.
Consumer: fetch-size-avg, fetch-size-max, fetch-rate, fetch-throttle-time-avg and fetch-throttle-time-max.
Producer: batch-size-avg, batch-size-max, requests-in-flight and record-retry-rate.
Both: io-ratio and io-wait-ratio.