Stream Data Processing Flashcards
Why can’t event stream data not be stored as big data?
It would result in one small file per event –> too many files
Event hub
A buffer that buffers the events and stores them in batches.
The different kinds of stream processing
1 - strema data integration
2 - stream analytics
Stream data integration
Focuses on ingestion and processing of the data sources targeting ETL
Stream analytics
Targets analytics use cases. Calculates aggregates and detects patterns.
Native streaming
Events are processed as they arrive -> lowest latency but high fault tolerance
Window
A certain amount of data to perform computations on
Three types of windows
1 - fixed/tumbling windows
2 - sliding/hopping windows
3 - session windows
Fixed/tumbling windows
Stops if the window is full, based on the count of items or the time
Sliding/hopping windows
Stops based on window + sliding interval length
Session windows
Sequences of temporarily related events terminated by a gap of inactivity
Which two kinds of queries are there?
1 - ad-hoc queries
2 - standing queries
Standing queries
Queries that are stored and permanently executed
Ad-hoc queries
One time questions
Main differences between batch processing and stream processing?
1 - The input is not controlled by the system
2 - The input timing/rate is often unknown
Multi-query optimization
Using subqueries to not recompute things in different queries
Ways to handle overload?
- Back pressure
- Load shedding
- Distributed stream processing
Back pressure
Slows down the sources to avoid data loss (for example by blocking queries)
Load shedding variants
1 - random sampling-based shedding
2 - relevance-based shedding
3 - summary-based shedding
Random-sampling-based load shedding
Taking a random sample and provide an output based on this (approximation)
Relevance-based load shedding
Use an algorithm that understands which distances are relevant while the ones that are not relevant are removed
Summary-based load shedding
Given a queue, the summary of the queue is provided, which is used as input for the next step
distributed stream processing
Distribute the stream based on data flow (distributes the query) or on key range (distrbutes the data stream itself)
Three options for delivery guarantees
1 - at most once –> can cause data loss on failures
2 - at least once –> might create incorrect states because data is processed multiple times
3 - exactly once