WIP Stream Processing Flashcards
What is message broker?
Database dedicated for storing messages/events and notifying listeners about new data available.
Decouples producers from consumers.
Can provide some delivery guarantees and durability of messages (by persisting to the disk).
Increases fault tolerance if message is persisted.
Decreases latency and throughput - message needs to be stored -> IO operations -> relatively slow.
Usually, if messages are persisted, they are only for SHORT period of time (unlike classic RDBS)
Optimized for reading topics, prefixes. Not good for randomly accessing, searching data.
Usually communication is done async. Producers receive ACK meaning broker accepted the message but it was not yet delivered to any consumer.
What is are disadvantages of polling-based-consumers?
Increased latency and wasted throughput if done too frequently (many polling requests would return empty or small batches).
This increases network and database load.
If done too rarerly - delay between messages being published and processed goes up.
What is stream? What does it consist of?
Stream in general is data made incrementally available over time.
Usually it’s assumed to be infinite - unbounded.
It consists of events/messages.
These are SELF-CONTAINED, IMMUTABLE objects.
They describe what happened in the system.
Can be fex json format.
Contain timestamp (time-of-day type) when the message was published.
Should be relatively small to achieve desired throughput (heavier events = less data can be pushed into the stream or read from over some time unit).
Difference between stream and batch processing?
Stream processing works with data unbounded in size. New data can appear anytime.
Batch processing works with fixed size data.
Streaming systems process data usually shortly after it was published. Some systems like Kafka/Kinesis allow to replay the stream of events hence the processing time might not be related to event’s publish time.
Batch - data from fix period of time is processed anytime.
What if consumer cannot keep up with processing the stream because producer is pushing too much data?
Messages can be buffered.
Requires selecting optimal buffer size (too large would be an extra cost, too little - running into risk of dropping messages).
Backpressure mechanism
Consumer or broker blocks producer or refuses to accept a message.
TCP - blocking buffer, if it’s filled then producer won’t receive ACK until buffer is read by consumer.
Kinesis - each shard has limits on write side, error status code is returned if producers exceed. They should retry then.
Dropping messages.
Depends if acceptable from business perspective. High volume sensors data might fall into this category.
Is approach of producers retrying a message fault tolerant?
It’s not unless producer persists them until successfully published.
Producer could crash awaiting broker’s ACK or making retry request.
What is log compaction?
Process of cleaning up the stream in log based message brokers. The goal is to have only a latest update message per record. Older ones can be then discarded and disk space is saved.
Deletes handled by tombstone/null value.
What time-related challenge is there with stream processing? How can this be dealt with?
tldr: straggler events
To compute some metrics/operations or perform joins processor might want to gather all relevant events from within some time window. Some events might be sent late to the stream AFTER the window they would belong to is already completed.
This could be due to clocks desync, producer/broker being overloaded, network faults etc.
Stragglers can sometimes be ignored (small % of data).
Result of windowed-computation can be corrected taking straggler into account; previous result would need to be somehow invalidated; also - previous output would still need to be available to perform the correction.
Producer passes a special message to consumer saying that window is completed. Tricky to maintain (adding producer and consumer - extra data needs to be tracked)
What stream processing can be used for?
- Deriving new views of data like search index
- Synchronizing data with external system
- Remote procedure call
- Analysing metrics, aggregates, finding anomalies
- Real time notifications when pattern is matched (like saved searches @ otomoto)
Describe some types of computation window.
Window which takes events in given time interval. None of the windows overlap.
Windows can overlap with each other (like 1 minute margin. Example: next window starts with final 1 minute “worth” of the previous window’s data)
Sliding window - basically overlap with as little as possible resolution. At given processing point in time take all events published later than start time and before end time.
Session-based. No fixed duration window. Take all events from start of the session until it expired.