Introduction Flashcards
What is a paradigm shift that has been happening in the latest years (according to confluence)?
A shift from state based system to even driven systems.
What’s the basic analogy of what kafka is?
It’s a distributed log storage
How many companies use kafka worldwide?
35% of the Fortune 500’s.
What is the basic use case for kakfa?
Processing large stream of events in event driven systems.
What is a producer in kakfa?
And application that gets data into the cluster.
What is a Broker in kafka?
A broker is an individual “node” in a kafka cluster.
What is a consumer in kakfa?
An application that processes the data from the cluster.
Can a consumer also be a producer?
Yes. An application can be both and generate events as much as consume it.
Can a consumer also be a producer?
Yes. An application can be both and generate events as much as consume it.
What is the zookeeper ensamble?
It’s a small cluster responsible for keeping consensus and cluster status data for the kakfa cluster.
Will zookeeper be used in the future?
Since version 2.8, it’s been available as an option to self manage the cluster.
What is a topic in kafka?
Is a storage of related events (like a log).
Is there a limit on the number of topics?
There’s not theoretical limit on number of topics but a practical limit on the number of partitions.
What is a partition in kafka?
Partitions are the blocks of data that compose a topic and can be stored into different brokers for the same of durability and replication.
How is data written to topics?
Data is always appended to the end of the “log”/topic and is immutable.
What are the semantics of writing data to kafka?
- You always write at the end of the log
- Data is immutable
- Data can have expiration date
- Each event has a sequencial offset number
What are the semantics of reading data from kakfa?
- Reading doesn’t remove of destroy the data.
- Consumers read data independently reading from different offsets
What is the structure of a kafka record?
- key
- value
- timestamp
- optional headers
What is the native language support for producers/consumers?
Java.
What is the role of the key in a kafka record?
If a key is provided with the data, the key will be hashed out and the data stored in a particular partition. This guarantees that data with the same key is ordered correctly.
This is used when ordering of the data is important.
What happens when you dont specify a key for a record?
It’s will be allocated to a partition in a round robin fashion.
What is the consumer offset topic?
A special topic that keeps track of the offsets each consumer consumed.
What is a consumer group?
A way to scale and group consumers that do the same job.
Each consumer is a consumer group by default.
How do you subscript to multiple topics at once?
Either specify a list of names or a regular expression to match the names.
How do you subscript to multiple topics at once?
Either specify a list of names or a regular expression to match the names.
What is the code architecture of a producer?
- KafkaProducer class.
- Server list
- Serializer for keys/value
- ProducerRecord(topic, key, value)
- producer.send(record)
What is the code architecture of a consumer?
- Server list
- Group id
- OnMessage()
- OnError()
- OnConsumerError()
- consumer.subscribe(“topic”)
- while(true) { consumer.poll() }
What is the code architecture of a consumer?
- Server list
- Group id
- OnMessage()
- OnError()
- OnConsumerError()
- consumer.subscribe(“topic”)
- while(true) { consumer.poll() }
How long are topics kept by default?
1 week
How do you set data retention policy?
- Globally
- By topic
How are delivery guarantees defined?
- At most once
- At least once
- Exactly once
Who’s a good candidate for “at most once” processing?
Data that you can afford to lose and care more about latency. Example: not so important logs.
Who’s a good candidate for “at least once” processing?
Idempotent Producers/consumers where you don’t care if a duplicate slips through.
What are the semantics for “exactly once” processing?
- Strong transactional guarantees
- Give you a guarantee that you only process each message a single time.
What is a compacted topic?
Is a topic that is configured to keep only the last event of each key when you don’t care about the previous events.
What is kafka connect?
It’s a plugable system to integrate automatically with external systems like elastic, cassandra, mysql, etc.
How does Kafka Connect run?
It’s its own service that is separate from the Kafka Cluster (probably kinda like Kibana in a way?).
Does Kafka support http rest API?
Yes, via a separate REST proxy.
What is the confluent schema registry?
It’s Confluent’s solution to the problem of versioning data, producer and consumers and handle schema migrations while keeping everyone in sync.
How is the Schema Registry run?
It’s a separate application like Kafka Connect.
What is AVRO?
It’s a JSON schema type definition from Apache.
What is ksqlDB?
It’s an SQL like language to perform operations over streams of data (like aggregations, for example), so you don’t have to write a consumer for simple tasks.
How do ksqlDB queries work?
They create a stream of data that is always running and outputting the result to a result topic.
What is Kakfa Streams?
It’s a library to do more high level operations with topics like exactly once semantic, aggregations, filtering, microservice operations, etc.