Stream processing Flashcards
When is a problem a data streaming problem?
When the data is unbounded. That is, when the data is not static and does not have a size property. This includes when a process is generating data continuously, or when the user need to continuously monitor processes.
Unbounded data can only be enumerated/iterated upon when given a snapshot.
Why do we need streaming windows?
To split the stream into batches of data, which allows the data to be processed as if it were bounded.
What types of windows do we get with stream processing?
There are two main types of windows: Streaming windows and Sessions windows.
Steaming windows are static in size or time-length, and are based upon a range and a slide. The relation between the range and the slide affect the type of window.
There are three types of streaming windows:
Sliding window:: Where range > slide. So windows can overlap
Tumbling window: Where range == slide. So the end of a window is the start of a new window (provided it is not the first/last window)
Jumping window: where range < slide. So there can be a timeframe without a window.
Session windows are dynamically sized windows that aggregate batches of (typically) user activity. Session windows end once a session gap (ie a certain amount of time) has passed since the latest activity.
What is the difference between event, processing and ingestion time?
Event time is the time at which an event has occurred.
Processing time is the time at which an event is observed in the system.
Ingestion time is the time at which an event is processed by the system.
What is the difference between micro batching and stream processing?
Micro-batching aggregates data in batches, and processes a batch in one go.
Stream processing processes events one by one as they arrive.
Stream processing can emulator micro-batching by using delayed triggers.
(So, Micro-batching aggregates data before processing it, while stream processing processes each event individually as they come in)
What is the problem with state in streaming systems?
Streaming systems only contain state modifications. This means that they do not inherently contain a state, and is instead managed ‘externally’, making it unreliable to use for application logic.
How can we disseminate (= distribute) events from producers to consumers?
To just move the data from producers to consumers, Messaging systems are used.
There are three main types of messaging systems:
Direct messaging systems (DMS): use simple network communication to broadcast messages. They are fast, but can easily lead to data loss, which must be dealt with by producers and consumers.
Message brokers, or queues: Centralized systems that sit between producers and consumers to deal with the complexities of reliably message delivery. A producer can send messages in two different modes:
Fire and forget: The consumer, or broker, immediately acknowledges (acks) the message
Transaction-based: The broker writes the message to disk before acting it
Log-based messaging systems: Producers append to a log, which a consumer can connect to and pull messages from. This log is distributed on a cluster of machines
What are the messaging patterns used by Publish/Subscribe messaging systems?
There are three types of messaging patterns:
Competing workers: Multiple consumers read from a single queue, competing for incoming messages.
Fan out: Each consumer has its own queue, and the incoming message is replicated on all queue
Message routing: Each message is assigned its own key. The consumer creates topic queue by specifying the keys it wants to receive.
How do we take consistent snapshots?
To take consistent, global snapshots, the Chandy-Lamport algorithm can be used. This models a distributed system as as a graph of processes.
To ensure global consistency in the snapshots, epoch markers are interleaved with messages. This structure is used to make operators wait for the same epoch marker from all channels before taking a snapshot.
Another method of ensuring global consistency is by having support for (one of) the guarantees of the streaming systems. There are three possible guarantees offered by streaming system:
At most once: An event will be processed once, if delivered at all
At least once: An event might flow through a system twice, in case of failure
Exactly once: An event only flows through a set of operators exactly once. This is a combination of At most once and At least once.