Introduction to Spark Streaming Flashcards
What is Spark Streaming?
A popular tool for stream processing in the industry.
What is Stream Processing?
Reduces the time between data acquisition and computation and deals with constantly changing data that needs quick analysis.
What are the typical data pipeline steps?
Extract, Transform, Load, Analyze
What is the difference between Batch and Stream Processing?
Batch processing works with static datasets while stream processing handles real-time, dynamic data, ensuring faster analysis of new or changed data.
What are the forms of streaming in Spark?
Micro-batch processing and continuous stream processing.
What is Micro-Batch processing?
Processes small batches of data every 100ms and provides a “once-and-only-once” guarantee.
What is continuous stream processing?
Processes data with latency as low as 1ms but offers an “at-least-once” guarantee, allowing duplicates.
What are some challenges with streaming?
Late events (events arriving later than expected can affect state consistency). End-to-End guarantees (requires fault tolerance across source, processing and sink stages), code portability (Batch API and Streaming API are similar but require minor modifications for compatibility).
What is the Spark Streaming Architecture?
Source (data is read), Sink (data is written), State (intermediate computations) and Result Table (processed results).
*triggers occur at specific intervals to process the data
What are the output modes in Spark Streaming?
Complete Output Mode, Update Output Mode and Append Output Mode.
What is Complete Output Mode?
Writes the entire result table to the sink during each trigger, used for aggregate queries.
What is Update Output Mode?
Updates rows in the result table without rewriting unchanged rows.
What is Append Output Mode?
Only adds new rows to the result table; no updates allowed. Aggregations with watermarks can be used in append mode.
What are the key limitations of output modes?
Append Mode cannot support queries requiring updates and Complete Mode does not support non-aggregate queries.
Which ones support aggregate queries and which ones don’t?
Complete Mode: Works best for aggregate queries because it rewrites the entire result table.
Append Mode: Does not support aggregate queries unless watermarks are used because it only appends new records without updating existing ones.
Update Mode: Can work with aggregate and non-aggregate queries, as it updates rows incrementally.