Test 1 Flashcards
In avro, adding a field to a record without default is a __ schema evolution
Forward
Explanation
Clients with old schema will be able to read records saved with new schema
A consumer wants to read messages from a specific partition of a topic. How can this be achieved?
Call assign() passing a collection of TopicPartitions as the argument
assign() can be used for manual assignment of a partition to a consumer, in which case subscribe() must not be used. Assign() takes a collection of TopicPartition object as an argument (https://kafka.apache.org/23/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#assign-java.util.Collection-)
We would like to be in an at-most once consuming scenario. Which offset commit strategy would you recommend?
Commit the offsets in Kafka, before processing the data
Explanation
Here, we must commit the offsets right after receiving a batch from a call to .poll()
The exactly once guarantee in the Kafka Streams is for which flow of data?
Kafka => Kafka
Explanation
Kafka Streams can only guarantee exactly once processing if you have a Kafka to Kafka topology.
To import data from external databases, I should use
Kafka Connect Source
Explanation
Kafka Connect Sink is used to export data from Kafka to external databases and Kafka Connect Source is used to import from external databases into Kafka.
You want to sink data from a Kafka topic to S3 using Kafka Connect. There are 10 brokers in the cluster, the topic has 2 partitions with replication factor of 3. How many tasks will you configure for the S3 connector?
2
Explanation
You cannot have more sink tasks (= consumers) than the number of partitions, so 2.
What is the risk of increasing max.in.flight.requests.per.connection while also enabling retries in a producer?
Message order is not preserved
Explanation
Some messages may require multiple retries. If there are more than 1 requests in flight, it may result in messages received out of order. Note an exception to this rule is if you enable the producer setting: enable.idempotence=true which takes care of the out of ordering case on its own. See: https://issues.apache.org/jira/browse/KAFKA-5494
What client protocol is supported for the schema registry? (select two)
HTTP
HTTPS
Explanation
clients can interact with the schema registry using the HTTP or HTTPS interface
Where are the dynamic configurations for a topic stored?
In Zookeeper
Explanation
Dynamic topic configurations are maintained in Zookeeper.
Which of the following errors are retriable from a producer perspective? (select two)
NOT_LEADER_FOR_PARTITION
NOT_ENOUGH_REPLICAS
Explanation
Both of these are retriable errors, others non-retriable errors.
See the full list of errors and their “retriable” status here: https://kafka.apache.org/protocol#protocol_error_codes
There are two consumers C1 and C2 belonging to the same group G subscribed to topics T1 and T2. Each of the topics has 3 partitions. How will the partitions be assigned to consumers with PartitionAssignor being RoundRobinAssignor?
C1 will be assigned partitions 0 and 2 from T1 and partition 1 from T2. C2 will have partition 1 from T1 and partitions 0 and 2 from T2.
Explanation
The correct option is the only one where the two consumers share an equal number of partitions amongst the two topics of three partitions. An interesting article to read is: https://medium.com/@anyili0928/what-i-have-learned-from-kafka-partition-assignment-strategy-799fdf15d3ab
is KSQL ANSI SQL compliant?
No
Explanation
KSQL is not ANSI SQL compliant, for now there are no defined standards on streaming SQL languages
A bank uses a Kafka cluster for credit card payments. What should be the value of the property unclean.leader.election.enable?
False
Explanation
Setting unclean.leader.election.enable to true means we allow out-of-sync replicas to become leaders, we will lose messages when this occurs, effectively losing credit card payments and making our customers very angry.
You want to perform table lookups against a KTable everytime a new record is received from the KStream. What is the output of KStream-KTable join?
KStream
Explanation
Here KStream is being processed to create another KStream.
What isn’t a feature of the Confluent schema registry?
Store Avro data
Explanation
Data is stored on brokers.
If I want to send binary data through the REST proxy to topic “test_binary”, it needs to be base64 encoded. A consumer connecting directly into the Kafka topic “test_binary” will receive
Binary data
Explanation
On the producer side, after receiving base64 data, the REST Proxy will convert it into bytes and then send that bytes payload to Kafka. Therefore consumers reading directly from Kafka will receive binary data.
A topic receives all the orders for the products that are available on a commerce site. Two applications want to process all the messages independently - order fulfilment and monitoring. The topic has 4 partitions, how would you organise the consumers for optimal performance and resource usage?
Create 2 consumer groups for 2 applications with 4 consumers each
Explanation
two partitions groups - one for each application so that all messages are delivered to both the application. 4 consumers in each as there are 4 partitions of the topic, and you cannot have more consumers per groups than the number of partitions (otherwise they will be inactive and wasting resources)
If I want to have an extremely high confidence that leaders and replicas have my data, I should use
acks=all, replication factor=3, min.insync.replicas=2
Explanation
acks=all means the leader will wait for all in-sync replicas to acknowledge the record. Also the min in-sync replica setting specifies the minimum number of replicas that need to be in-sync for the partition to remain available for writes.
Which is an optional field in an Avro record?
doc
Explanation
doc represents optional description of message
Which of the following setting increases the chance of batching for a Kafka Producer?
increase linger.ms
Explanation
linger.ms forces the producer to wait to send messages, hence increasing the chance of creating batches
Where are the ACLs stored in a Kafka cluster by default?
Under Zookeeper node /kafka-acl/
Explanation
ACLs are stored in Zookeeper node /kafka-acls/ by default.
You are using JDBC source connector to copy data from 2 tables to two Kafka topics. There is one connector created with max.tasks equal to 2 deployed on a cluster of 3 workers. How many tasks are launched?
2
Explanation
we have two tables, so the max number of tasks is 2
What isn’t an internal Kafka Connect topic?
connect-jars
Explanation
connect-configs stores configurations, connect-status helps to elect leaders for connect, and connect-offsets store source offsets for source connectors
A kafka topic has a replication factor of 3 and min.insync.replicas setting of 1. What is the maximum number of brokers that can be down so that a producer with acks=all can still produce to the topic?
2
Explanation
Two brokers can go down, and one replica will still be able to receive and serve data
I am producing Avro data on my Kafka cluster that is integrated with the Confluent Schema Registry. After a schema change that is incompatible, I know my data will be rejected. Which component will reject the data?
The Confluent Schema Registry
Explanation
The Confluent Schema Registry is your safeguard against incompatible schema changes and will be the component that ensures no breaking schema evolution will be possible. Kafka Brokers do not look at your payload and your payload schema, and therefore will not reject data
Which of the following event processing application is stateless? (select two)
Read events from a stream and modify them from JSON to Avro
Read Log messages from a stream and write error events to a high priority stream and the rest to a low priority stream
Explanation
Stateless means processing of each message depends only on the message, so converting from JSON to Avro or filtering a stream are both stateless operations
A customer has many consumer applications that process messages from a Kafka topic. Each consumer application can only process 50 MB/s. Your customer wants to achieve a target throughput of 1 GB/s. What is the minimum number of partitions will you suggest to the customer for that particular topic?
20
Explanation
each consumer can process only 50 MB/s, so we need at least 20 consumers consuming one partition so that 50 * 20 = 1000 MB target is achieved.
If I produce to a topic that does not exist, and the broker setting auto.create.topic.enable=true, what will happen?
Kafka will automatically create the topic with the broker settings num.partitions and default.replication.factor
Explanation
The broker settings comes into play when a topic is auto created
A Kafka producer application wants to send log messages to a topic that does not include any key. What are the properties that are mandatory to configure for the producer configuration? (select three)
key. serializer
value. serializer
bootstrap. servers
Explanation
Both key and value serializer are mandatory.
Producing with a key allows to…
Influence partitioning of the producer messages
Explanation
Keys are necessary if you require strong ordering or grouping for messages that share the same key. If you require that messages with the same key are always seen in the correct order, attaching a key to messages will ensure messages with the same key always go to the same partition in a topic. Kafka guarantees order within a partition, but not across partitions in a topic, so alternatively not providing a key - which will result in round-robin distribution across partitions - will not maintain such order.
How do Kafka brokers ensure great performance between the producers and consumers? (select two)
It does not transform the messages
It leverages zero-copy optimizations to send data straight from the page-cache
Explanation
Kafka transfers data with zero-copy and sends the raw bytes it receives from the producer straight to the consumer, leveraging the RAM available as page cache
If a topic has a replication factor of 3…
Each Partition will live on 3 different brokers
Explanation
Replicas are spread across available brokers, and each replica = one broker. RF 3 = 3 brokers
To enhance compression, I can increase the chances of batching by using
linger.ms = 20
Explanation
linger.ms forces the producer to wait before sending messages, hence increasing the chance of creating batches that can be heavily compressed.
To get acknowledgement of writes to only the leader partition, we need to use the config…
acks = 1
Explanation
Producers can set acks=1 to get acknowledgement from partition leader only.
A client connects to a broker in the cluster and sends a fetch request for a partition in a topic. It gets an exception NotLeaderForPartitionException in the response. How does client handle this situation?
Send Metadata request to the same broker for the topic and select the broker hosting the leader replica
Explanation
In case the consumer has the wrong leader of a partition, it will issue a metadata request. The Metadata request can be handled by any node, so clients know afterwards which broker are the designated leader for the topic partitions. Produce and consume requests can only be sent to the node hosting partition leader.
What information isn’t stored inside of Zookeeper? (select two)
Consumer Offset
Schema Registry Schemas
Explanation
Consumer offsets are stored in a Kafka topic __consumer_offsets, and the Schema Registry stored schemas in the _schemas topic.
How will you find out all the partitions where one or more of the replicas for the partition are not in-sync with the leader?
Kafka-topic.sh –zookeper localhost:2181 –describe –under-replicated-partitions
A consumer application is using KafkaAvroDeserializer to deserialize Avro messages. What happens if message schema is not present in AvroDeserializer local cache?
Fetches schema from Schema Registry
Explanation
First local cache is checked for the message schema. In case of cache miss, schema is pulled from the schema registry. An exception will be thrown in the Schema Registry does not have the schema (which should never happen if you set it up properly)
You are running a Kafka Streams application in a Docker container managed by Kubernetes, and upon application restart, it takes a long time for the docker container to replicate the state and get back to processing the data. How can you improve dramatically the application restart?
Mount a persistent volume for your RocksDB
Explanation
Although any Kafka Streams application is stateless as the state is stored in Kafka, it can take a while and lots of resources to recover the state from Kafka. In order to speed up recovery, it is advised to store the Kafka Streams state on a persistent volume, so that only the missing part of the state needs to be recovered.
Which of the following Kafka Streams operators are stateful? (select all that apply)
joining
aggregate
count
reduce
Explanation
See: https://kafka.apache.org/20/documentation/streams/developer-guide/dsl-api.html#stateful-transformations
Which KSQL queries write to Kafka?
COUNT and JOIN
CREATE STREAM AS SELECT and CREATE TABLE AS SELECT
CREATE STREAM WITH SELECT and CREATE TABLE WITH
Explanation
SHOW STREAMS and EXPLAIN statements run against the KSQL server that the KSQL client is connected to. They don’t communicate directly with Kafka. CREATE STREAM WITH and CREATE TABLE WITH write metadata to the KSQL command topic. Persistent queries based on CREATE STREAM AS SELECT and CREATE TABLE AS SELECT read and write to Kafka topics. Non-persistent queries based on SELECT that are stateless only read from Kafka topics, for example SELECT … FROM foo WHERE …. Non-persistent queries that are stateful read and write to Kafka, for example, COUNT and JOIN. The data in Kafka is deleted automatically when you terminate the query with CTRL-C.
A Zookeeper ensemble contains 3 servers. Over which ports the members of the ensemble should be able to communicate in default configuration? (select three)
2888
3888
2181
Explanation
2181 - client port, 2888 - peer port, 3888 - leader port
Using the Confluent Schema Registry, where are Avro schema stored?
In the _schemas topic
Explanation
The Schema Registry stores all the schemas in the _schemas Kafka topic
There are 3 producers writing to a topic with 5 partitions. There are 5 consumers consuming from the topic. How many Controllers will be present in the cluster?
1
Explanation
There is only one controller in a cluster at all times.
You have a Kafka cluster and all the topics have a replication factor of 3. One intern at your company stopped a broker, and accidentally deleted all the data of that broker on the disk. What will happen if the broker is restarted?
The broker will start, and won’t be able to be online until all the data it needs to have is replicated from other leaders.
Explanation
Kafka replication mechanism makes it resilient to the scenarios where the broker lose data on disk, but can recover from replicating from other brokers. This makes Kafka amazing!
Select all that applies
asks is a producer setting
min. insync.replicas is a topic setting
min. insync.replicas only matters if acks=all
Explanation
acks is a producer setting min.insync.replicas is a topic or broker setting and is only effective when acks=all
How will you read all the messages from a topic in your KSQL query?
Use KSQL CLI to set auto.offset.reset property to earliest
Explanation
Consumers can set auto.offset.reset property to earliest to start consuming from beginning. For KSQL, SET ‘auto.offset.reset’=’earliest’;
A consumer starts and has auto.offset.reset=latest, and the topic partition currently has data for offsets going from 45 to 2311. The consumer group has committed the offset 643 for the topic before. Where will the consumer read from?
643
Explanation
The offsets are already committed for this consumer group and topic partition, so the property auto.offset.reset is ignored
Your producer is producing at a very high rate and the batches are completely full each time. How can you improve the producer throughput? (select two)
Enable compression
Increase Batch Size
Explanation
batch.size controls how many bytes of data to collect before sending messages to the Kafka broker. Set this as high as possible, without exceeding available memory. Enabling compression can also help make more compact batches and increase the throughput of your producer. Linger.ms will have no effect as the batches are already full
Select all the way for one consumer to subscribe simultaneously to the following topics - topic.history, topic.sports, topic.politics? (select two)
consumer. subscribe(Arrays.asList(“topic.history”.”topic.sports”,”topics.politics”);
consumer. subscribe(Pattern.compile(“topic..*”));
Explanation
Multiple topics can be passed as a list or regex pattern.