Lecture 9 Flashcards
Black box approach for big data analysis
- Users issue analysis queries with real-time semantics
- Streams of data updates, time-varying rates, generated in real-time
- Streams of results data
- Processing in near real-time
What is Distributed Stream Processing System?
- Queries consists of operators
- Operators form graphs
- Operators process streams of tuples on-the-fly
- Operators span nodes
How do we build a stream processing platform in the cloud?
Intra-query parallelism
Provisioning for workload peaks unnecessarily conservative
- Dynamic scale out:
Increase resources when peaks appear
Failure resilience:
- *Active** fault-tolerance needs 2x resources
- *Passive** fault-tolerance leads to long recovery times
- Hybrid fault-tolerance
Low resource overhead with fast recovery
“Both mechanisms must support stateful operators”
Which one is positive Stateless vs Stateful Operators?
Stateless
- Failure of recovery
- Scale out
Stateful
X Failure recovery
X Scale
![](https://s3.amazonaws.com/brainscape-prod/system/cm/144/571/879/a_image_thumb.jpg?1659463450)
Diagrams for processing state, routing state, buffer state
![](https://s3.amazonaws.com/brainscape-prod/system/cm/144/572/014/a_image_thumb.jpg?1659463451)
What is Checkpoint?
Takes snapshot of state and makes it externally available
![](https://s3.amazonaws.com/brainscape-prod/system/cm/144/572/561/a_image_thumb.jpg?1659463451)
What is Backup?
Moves copy of state from one operator to another
![](https://s3.amazonaws.com/brainscape-prod/system/cm/144/572/701/a_image_thumb.jpg?1659463451)
What is Partition?
Splits state in semantically correct fashion for parallel processing