Q & A Flashcards
What is a batch?
A collection of messages being produced to the same topic and partition.
What is a controller?
The “lead” broker within a cluster that manages the state of partitions, replicas, and partition reassignment.
How is a controller elected?
When a controller goes down, the available nodes within Zookeeper will determine the controller.
What is Zookeeper?
Zookeeper is a centralized service that keeps track of the brokers, configurations, topics, and partitions.
What is a leader?
Partitions are owned by a single broker in the cluster and that broker is the leader. The leader is the only one responsible for produce-consumer operations.
What are the three types of replicas?
Leader, follower, and preferred.
How are new leader replicas determined?
When the existing leader goes down (or becomes unresponsive), if auto.leader.rebalance.enable=true is set (default) it will check if the preferred leader is in-sync and select it as the leader. Otherwise, another in-sync replica will be chosen.
What is retention? What are the two types of retention?
Retention is a configurable time window that will determine how long messages are stored within a given topic. The two types of retention are delete and compact.
What is log compaction?
Compaction is a type of retention in which only the latest value of a given key is retained after the retention period has elasped.
Where is broker information stored in Zookeeper?
Under the /brokers/ids directory.
What is an ephemeral node?
When a broker starts up, an ephemeral node is created to represent it in Zookeeper. This node will stick around to allow brokers that go offline to immediately rejoin the cluster once back online.
What is an ensemble?
A cluster of Zookeeper nodes.
What is the preferred number of Zookeeper nodes?
An odd number, preferably something that adheres to 2N+1 as Zookeeper requires a quorum to make elections and respond to requests.
What is the default port(s) for Zookeeper?
2181 is the primary port, 2888 is used for elections, and 3888 is the leader port.
What is the default broker port?
9092
What is the default KSQL port?
8088
What is the default schema registry port?
8081
What is the auto.create.topics.enable setting? What actions can cause a new topic to be created?
When enabled, auto.create.topics.enable allows the broker to dynamically create topics if they don’t already exist. Any attempts to produce, consume, or request metadata from a topic will cause it to be created (using the default replication and partition settings from the broker).
What is the default number of partitions?
One partition is the default, although it is not preferred for scaling purposes.
How is a request handled in Kafka?
The process goes:
- Client Request
- Broker
- Partition Leader(s)
- Response
- Client
Are there any guarantees within Kafka with regards to ordering of messages?
Messages are always guaranteed to be ordered over a single partition.
What is a segment? What do they contain?
Partitions are divided into segments, which default to either 1GB of data or a week of messages. Each segment contains the messages (keys, values) over two indices (one related to offsets and another related to timestamps)
What is the unit of storage within Kafka?
A partition
What is stored within a message on disk?
The key, the value, a checksum for corruption, the encoding, format, timestamp.
When is compaction evaluated?
When a given segment of a partition is closed, any log compaction will be performed.
Where does Kafka store any dynamic topic configurations?
In Zookeeper
What producer setting determines when a given message is ready to be consumed? What are the different options for it?
The acks property determines when a message has been “received” and it can be set to the values of 0, 1, or all.
What do the different ack configurations mean?
If acks=0, no acknowledge is required from the broker that the message was received. If acks=1, the leader partition must confirm to have received the message. If acks=all, the topic specific min.insync.replicas property will be used to govern how many replicas have to write the message before returning a response.
What are the three minimally required settings for a producer to have configured?
A producer at a minimum needs a broker (e.g. bootstrap server), a key serializer, and a value serializer.
What is linger.ms? How can it help with batching?
linger.ms defines the time to wait before sending a batch, which can allow time for more messages to arrive and be sent, which can increase throughput.
What are some common mechanisms to increase producer throughput?
You want to ensure that batching is being performed effectively, so adequate min/max batch sizes, adequate linger values, and enabling compression.
What is the default compression used for Kafka messages? What are the other options?
By default, messages are not compressed. Supported compression types are snappy, gzip, and lz4.
What are the roles of the producer and consumer during compression?
The producer is responsible for compressing the messages and the consumer is responsible for decompressing them.
What is max.in.flight.requests.per.connection? What are the potential dangers of adjusting it?
max.in.flight.requests.per.connection is a producer setting that indicates how many messages can be sent to the server without receiving a response.
Setting this too high can result in batching becoming less efficient, increased memory usage, and error may result in loss of proper ordering.
Setting the value to 1 will ensure that only a single message is sent at a time, guaranteeing message ordering.
What are the two types of errors when producing a Kafka message? What are some examples of each?
Retriable and non-retriable.
Retriable messages include those were the broker or leader was not available, which will be automatically retried in hopes that a new one was elected and the message resolves.
Non-retriable messages include those that will consistently fail without some type of interaction such as a MessageSizeExceeded, which will throw an exception immediately.
Why are the brokers referred to as bootstrap servers?
Since each broker contains metadata about the other brokers, any of them is capable of receiving a message and sending it to the appropriate place.
What is unclean.leader.election.enable?
If unclean.leader.election.enable is set to true, it allows for an out-of-sync replica to be elected as leader. This is only useful if you don’t necessarily care about data integrity and prefer availability over it.
What is a consumer group?
A group of related consumers that perform a specific task. Each consumer within a group will have mutually exclusive partitions and offsets.
What is a consumer?
Similar to a producer that produces messages to a broker and a specific topic, a consumer consumes messages from a given topic.
What is a rebalance? What can cause them?
Rebalancing is the act of changing responsibility for a given partition from one consumer to another. This can occur when the number of partitions for a topic changes or the addition/removal of a consumer from the group.
What are the three supported syntaxes to subscribe to a topic?
A topic can be subscribed to via a string, an array of topics, or a pattern (e.g. regular expression).
What is the polling loop responsible for? What happens on the initial poll?
The polling loop will handle retrieving records, metadata, partition rebalancing info, heartbeats, and more during each poll. Additionally the initial poll is responsible for registering with the group coordinator and finding the appropriate partition to use.
What is the session.timeout.ms property?
It’s the number of milliseconds that a consumer can be unresponsive from the broker without timing out.
What is auto.offset.reset? What are the supported values of it?
auto.offset.resets lets a consumer group know where to begin/continue reading offsets from for a given topic. It can be set to earliest, latest, or none (which will blow up if any existing offsets are found).
What is a worker?
A single Java process, generally used with regards to Kafka Connect.
What is the tasks.max property? What are the preferred configurations for sink connectors? What about source connectors?
This designates the number of workers that will work sending data into/out of Kafka topics.
For source connectors, it should be set to one. For sink connectors, it can be higher.
Where does Kafka store schemas from the schema registry?
In the __schemas topic.
What is a SpecificRecord? What about a GenericRecord?
A SpecificRecord is a strongly typed Java class generated via a Maven or Gradle plugin targeting existing Avro files.
A GenericRecord is a explicitly declared schema that must be accessed via index or name specifically.