Lecture 9 Flashcards
Black box approach for big data analysis
- Users issue analysis queries with real-time semantics
- Streams of data updates, time-varying rates, generated in real-time
- Streams of results data
- Processing in near real-time
What is Distributed Stream Processing System?
- Queries consists of operators
- Operators form graphs
- Operators process streams of tuples on-the-fly
- Operators span nodes
How do we build a stream processing platform in the cloud?
Intra-query parallelism
Provisioning for workload peaks unnecessarily conservative
- Dynamic scale out:
Increase resources when peaks appear
Failure resilience:
- *Active** fault-tolerance needs 2x resources
- *Passive** fault-tolerance leads to long recovery times
- Hybrid fault-tolerance
Low resource overhead with fast recovery
“Both mechanisms must support stateful operators”
Which one is positive Stateless vs Stateful Operators?
Stateless
- Failure of recovery
- Scale out
Stateful
X Failure recovery
X Scale
Diagrams for processing state, routing state, buffer state
What is Checkpoint?
Takes snapshot of state and makes it externally available
What is Backup?
Moves copy of state from one operator to another
What is Partition?
Splits state in semantically correct fashion for parallel processing