Module 11 - Apache Kafka Flashcards
What is Apache Kafka?
Kafka is a stream processing & event-processing platform designed to handle heavy updates in a distributed system
What are the 3 main features of Apache Kafka?
- Publish-Subscribe (message-oriented communication)
- Real-time stream processing
- Distributed and replicated storage of messages and streams
Kafka supports many different _______ for interacting with other systems.
APIs
The producer and consumer API is useful for what functionality?
Providing asynchronous communication between applications
What are some uses for Apache Kafka?
Any of:
- High-throughput messaging
- Website activity tracking
- Metric collection
- Low-latency log aggregation
- Stream processing
- Event sourcing and commit logging
The Processor API can be used to implement both _____ as well as ______ operations
stateless
stateful
With the Apache Kafka processor API, ______ operations are achieved through the use of _____ stores
stateful
state
What is a Topic in Kafka?
A topic is a stream of records, or a log of events in the message queue.
What exactly is a “record”
A collection of data items arranged for processing by a program
In Kafka, you create different _____ to hold different kinds of _____. Different ______ hold filtered and transformed _____ of the same kind of event.
topics
events
topics
versions
A _____ is a stream of records. It is stored as a partitioned _____. The _____ period for ______ records is ______
topic log retention published configurable
Why do we partition logs into different logs?
Improve scalability, throughput, and storage capacity
Kafka stores for each consumer/reader the ______ of the next record to be read. This is represented as an ______ in the log.
position
offset
Kafka stores state for the position of the next ______ to be read/consumed by the reader/consumer
record
What is the benefit of having the Kafka system store the state of the next record to be read instead of the reader?
The state does not need to be maintained by the reader. There is improved fault tolerance since the Kafka system stores the order and position of each reader/consumer
What are 3 characteristics of producers in Kafka?
- Pushes records to Kafka brokers, and chooses which partition to contact for a given topic
- Can batch records and send them to broker asynchronously
- Can perform idempotent delivery (avoids duplicate commits)
What is the benefit of the producer being able to batch records and sending them to broker asynchronously?
Much better throughput, and latency penalty is negligible
In Kafka, producers choose which _____ to contact for a given ______
partition
topic
What are 2 characteristics of consumers in Kafka?
- Pulls records in batches from a Kafka broker, who advances the consumer’s offset in the topic
- Able to achieve “exactly once” semantics when a client consumes from one topic and produces to another
A consumer in Kafka pulls ______ in ______ from a broker. The broker advances that consumer’s _____ within the topic.
records
batches
offset
When a client consumes from one topic and produces to another, Kafka achieves _____ _____ semantics
exactly once
A Kafka broker is also known as a Kafka ______
server
The producer can push records _____ at a time, or it can _____ them
one
batch
What does “idempotent delivery” mean in the context of Kafka?
How is it ensured in Kafka?
If a producer accidentally pushes the same message twice, then we can avoid duplicate commits
It is ensured by tagging commits with some sort of unique identifier
Producers will produce _____ but consumers can consume in _____
individually
groups
Kafka internally uses a ____ ____ to store the state of a _____ operator
state store
stateful
What are console producer and console consumers?
Command line tools which are provided with Kafka
In Kafka, there are two ways to interpret semantics of a stream. What are they? and what are their properties?
Record Stream: Each record represents a state transition (ex: the balance of an account number ABC is increased by XYZ)
Change-log stream: Each record represents a state (ex: account number ABC has a balance of XYZ)
What is the representation for Record Streams and Change-log streams in the Kafka Streams API?
KStream for record streams
KTable for change-log streams
The _____ of streams and tables refers to the fact that change-log streams and tables are logically _______
duality
interchangeable
Explain how change-log streams and tables are logically interchangeable.
Show how they can be represented as tables
- Each record in a change-log stream defines one row of the table, and overwrites any prior row for the same key
- A table can be viewed as a snapshot of the latest value for each key in a change-log stream
This is why a change-log stream in Kafka is represented using a KTable
What is a KGroupedStream object in Kafka?
KGroupedStream is an abstraction of a grouped record stream of KeyValue pairs. It is an intermediate representation of a KStream in order to apply an aggregation operation on the original KStream records.
What are some transformations for converting from a KStream to a KGroupedStream object in Kafka?
- groupBy
- groupByKey
What are some transformations for converting from a KGroupedStream object to a KTable object in Kafka?
- count
- reduce
- aggregate
What is the transformation for converting from a KTable object to a KStream object in Kafka?
- toStream
What is a KStream in Kafka?
An abstraction of a record stream of KeyValue pairs which represent events.
What is a KTable in Kafka?
An abstraction of change-log stream from a primary-keyed table. Each entry represents a state.
What does windowing allow for in Kafka?
Windowing allows for control on how to group records that have the same key for stateful operations such as aggregations or joins into “windows”.
What are “hopping time windows” in Kafka?
Windows are defined by
- size
- advance interval (also known as “hop”)
For example every 10 seconds (the hop), compute some transformation over the last 60 seconds (the size).
Hopping windows may be overlapping, or they may have gaps in between them
What are “Tumbling time windows” in Kafka?
Special case of hopping windows where the window size = the advance interval. There are no overlaps and it is gapless.
Each subsequent window is unique in elements
What are “Sliding windows” in Kafka?
Windows that slide continuously over the time axis, used only for joins
What are “Session windows” in Kafka?
Windows that aggregate data by period of activity. A new session is created when the period of inactivity exceeds a given threshold
Kafka is a distributed ____ streaming platform that lets you read, write, store, and process _____
event
events
Kafka Streams Transformations are available in two types: ______ and ______
Stateless
Stateful
KTable behaviour: If a key in the stream exists, it will be ______. If the key does not exist it will be ______.
updated
inserted
________ is an abstraction of a grouped record stream of KeyValue pairs. It is an intermediate representation of a _____ in order to apply an ______ operation on the original KStream records.
KGroupedStream
KStream
aggregation