Kafka Extended API Flashcards
What are Source Connectors used for?
To get data from Common Data Sources
Source Connectors are responsible for ingesting data into Kafka.
What are Sink Connectors used for?
To publish data in Common Data Stores
Sink Connectors send data from Kafka to external systems.
What is a task in Kafka Connect?
A task is linked to a connector configuration and executes tasks defined by the connector.
What is a Kafka Connect Worker?
A worker is a single Java process that executes tasks.
What is Standalone Mode in Kafka Connect?
A single process runs connectors and tasks, easy for development but lacks fault tolerance and scalability.
What is Distributed Mode in Kafka Connect?
Multiple workers run connectors and tasks, easy to scale and fault tolerant.
Define a stream in Kafka.
A sequence of immutable data records that is fully ordered, can be replayed, and is fault tolerant.
What is a stream processor?
A node in the processor topology that transforms incoming streams, record by record, and may create a new stream from it.
What is a topology in Kafka?
A graph of processors chained together by streams.
What is a Source Processor?
A processor that takes its data directly from a Kafka Topic.
What is a Sink Processor?
A processor that sends stream data directly to a Kafka topic.
What characterizes KStreams?
All inserts, similar to a log, and represent an infinite, unbounded data stream.
What characterizes KTables?
All upserts on non-null values, deletes on null values, and are similar to a database table.
When should you use KStreams?
When reading from a topic that’s not compacted and new data is partial information.
When should you use KTables?
When reading from a log compacted topic or needing a structure like a database table.
Define stateless transformation.
A transformation where the result only depends on the data-point being processed.
Define stateful transformation.
A transformation where the result depends on external information or state.
What does MapValues do?
Affects only values, does not change keys, and doesn’t trigger a repartition.
What does Map do?
Affects both keys and values and triggers a repartition.
What does Filter do?
Produces zero or one record, doesn’t change keys/values, and doesn’t trigger a repartition.
What is FilterNot?
The inverse of the filter operation.
What does FlatMapValues do?
Produces zero, one, or more records without changing keys, and doesn’t trigger a repartition.
What does FlatMap do?
Changes keys and triggers a repartition.
What does KStream Branch do?
Branches a KStream based on one or more predicates, resulting in multiple KStreams.
What does SelectKey do?
Assigns a new key to the record, changes the key, and marks data for re-partitioning.
For KStream
What can you read from Kafka?
A topic as a KStream, KTable, or GlobalKTable.
What can you write to Kafka?
Any KStream or KTable back to Kafka.
What is log compaction?
An optimization that removes some messages while keeping the order of messages.
What are the myths about log compaction?
- Does not prevent duplicate data.
- Does not prevent reading duplicates.
- Can fail occasionally.
What is the significance of KStream and KTable duality?
A stream can be a changelog of a table, and a table can be a snapshot of the latest value for each key.
How can you transform a KTable to a KStream?
In one line of code to keep a changelog of changes.
What methods can be used to transform a KStream to a KTable?
- Chain groupByKey() and aggregation step
- Write back to Kafka and read as KTable
What does KTable GroupBy do?
Allows more aggregations within a KTable and triggers a repartition.
What is KGroupedStream?
Obtained after a groupBy/groupByKey() call on a KStream.
What does the Count method do on KGroupedStream?
Counts the number of records by grouped key, ignoring null keys or values.
What is the difference between Aggregate and Reduce?
Aggregate requires an initializer, adder, Serde, and State Store while Reduce must have the same input and output type.
What does KStream Peek do?
Applies a side-effect operation to a KStream while returning the same KStream.
What is Exactly Once Semantics?
Guarantees that data processing and pushing the message back to Kafka happen only once.
What is At Least Once Semantics?
Messages may be received twice under certain conditions, such as broker reboots.
What’s windowing?
windowing gives snapshot of an aggregate withing a given timeframe.
what are the 4 types of windowing?
Tumbling: special type of the hopping window where the advanced by is equal to the window size. (don’t get duplicate or overlap like hopping)
Hopping: bound by time. Has a fixed start and end point and a fixed size.
Session: has window start and end, but they are not fixed like in tumbling and hopping. The window boundaries are determined by the events themselves.
Sliding: similar to hopping and tumbling cause fixed size. It’s driven by events and not time like the others.
What do you need to use the Hopping windowing?
define the window size (duration of the window and what events will fall in this window) and the advanced size (determines how the window advances, eg. advance one minute)
What is session windowing good for?
user browser sessions