Stream processing Flashcards
some examples of streaming data (3)
- log files generated by customers using a mobile application
- social network activity
- e-commerce purchases
what is stream processing
a processing mode where individual records or a small set of records are processed continuously, producing a simple response
can streaming data be processed by batch processing?
yes
what is bounded data?
datasets that are finite in size
what is unbounded data?
datasets that are (at least theoretically) infinite in size and new data can arrive and be made available at any point of time
What are streaming systems designed with in mind?
Unbounded data
What is a data surge?
a sudden and significant increase in the volume of data flowing through a streaming data processing system.
For real-time systems, why is failing to produce a processing result within a time window as bad as not producing
a result at all?
The events may become “insignificant” and the insights or trends produced may no longer be valid or accurate
Examples of streaming data (4)
- Messages from social platforms (e.g. Twitter)
- Internet traffic going through a network device such as a switch
- Readings from an IoT device
- Interactions of users with a web application
Frameworks for the ingestion of unbounded data (7)
- Apache Kafka
- Apache Flume
- Amazon Kinesis Firehose
- AWS IoT Events
- Azure Event Hub
- IoT Hub
- Google Pub/Sub
What are streams
sequences of immutable records that arrive at some point in time
Other phrases for streams (3)
- event streams
- event logs
- message queues
What type of dataset are streams?
datasets in motion
What type of dataset are tables?
datasets at rest
what are the components of processing elements (PE)? (3)
- input queue
- computing element
- output queue
why are input queues needed? (3)
- to maintain the order of the incoming events/data
- to signal systems to slow down when the rate at which data is arriving might exceed the processing capacity of the stream processing system
- to decouple the data source from the processing components allowing for more flexibility in management
What is a spout in Apache Storm?
Elements that generate streams from external sources
What is a stream in Apache Storm?
Unbounded stream of tuples
What is a bolt in Apache Storm?
A processing element that consumes and generates streams
What is a topology in Apache Storm?
a flow of spouts, streams and bolts
what does the order of tuples in a stream represent?
the time at which they arrive at the streaming system
what does the term at-least-once processing mean?
a guarantee that each message or event in a system will be processed at least once, but potentially more than once - ensuring that no data is lost in the event of failure
what is event time?
the time at which one event has been generated by a source
what is processing time?
the time at which events are seen by the stream processing system
what are unordered streams?
streams where the event time ordering is different from the processing time ordering
what is processing-time lag?
the delay between the generation and observation of an event in the system (this lag will change from tuple to tuple)
when is event time or processing time relevant in processing? (2)
- To aggregate tuples and produce an aggregate computation (e.g., count, average, ..)
- To observe temporal patterns
how do streaming systems to deal with temporal dimensions?
windowing
what are the types of window?
- fixed
- session
- sliding
what does stream processing by windowing require? (2)
- buffers to store tuples
- a trigger stream that triggers the computation
how does Spark streaming work? (4)
- Receives input data streams
- Divide the data into micro-batches by temporal windowing
- Batches are treated as RDDs and processed by the Spark engine
- Results are streamed as batches
What does Spark streaming use for stream processing?
Micro-batches
a key difference between streaming and micro-batch processing?
micro-batching has higher latency (delays) - up to the time interval defining the micro-batch
why is the inverse reduction function used when using a sliding window?
it prevents the code from recomputing the entire aggregation from scratch for each window update, you use the previous aggregate value and apply an inverse operation to adjust for the outgoing and incoming elements
What is Kafka described as?
“the HDFS of unbounded data sources”