Streaming Flashcards
What is data streaming in big data?
Data streaming refers to the continuous processing of data as it arrives in real-time, involving simpler computations over smaller, continuously incoming data sets as opposed to batch analysis
Why is streaming data processing important?
It helps achieve lower latency, enabling timely data analysis
More consistent resource consumption
Real time decision making
Workload balancing across system
What are some use cases of data streaming?
Use cases include operational monitoring
real time user interaction in web analytics
online advertising with real time bidding
social media analysis
IoT applications like health monitoring
What is the cash register model in data streams?
The cash register model represents data as increments to a variable
Such as counting packages sent to IP addresses
You INCREMENT the state
What is the turnstile model in data streaming?
The turnstile model includes updates to variables that can both increment and decrement values
(tracking people entering and exiting a gym)
How are data streaming systems classified based on latency tolerance?
Systems are classified as hard (no delay tolerance)
Firm (some tolerance for delay, seconds maybe)
Soft (still valuable even when late, weather monitoring)
What are the key characteristics of streaming data?
Generated continuously
One pass processing
Often requires approximation (memory and time constraints)
Handle infinite computation
Low latency
What components are involved in data streaming architecture?
Data Collection
Messaging Queue
Analysis
Data Consumer
(in memory storage)
(long term storage)
What is the role of message queuing in data streaming?
It facilitates the safe and organized exchange of data between the collection and analysis tiers, decoupling the system components
Also, its worth messaging that it acts as Unified Log in addition to managing data flows between components in the architecture, it also provides an append-only record of events to be accessed by consumer (analytical tier, apache flink)
What are common message queuing systems?
Apache Kafka (store data in log as topics) and Apache Flume (move data to long term storage like hdfs)
What are the message delivery semantics in streaming?
Exactly once (guarantees no data loss and single reads)
At most once (event read once, but maybe also not read)
At least once (ensures data is read, even if multiple reads)
What is a continuous query in data streaming?
A continuous query is executed once and runs continuously over incoming data
Updating results in real-time, often requiring state management for intermediate results
What challenges are associated with continuous queries?
Must handle state updates efficiently with memory and time constraints
What is windowing in data streaming?
Windowing divides a data stream into finite chunks for processing based on time or number of items
What are sliding windows?
Sliding windows overlap and process data at regular intervals (time based)