Introduction to Spark Streaming Flashcards

1
Q

What is Spark Streaming?

A

A popular tool for stream processing in the industry.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Stream Processing?

A

Reduces the time between data acquisition and computation and deals with constantly changing data that needs quick analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the typical data pipeline steps?

A

Extract, Transform, Load, Analyze

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between Batch and Stream Processing?

A

Batch processing works with static datasets while stream processing handles real-time, dynamic data, ensuring faster analysis of new or changed data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the forms of streaming in Spark?

A

Micro-batch processing and continuous stream processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Micro-Batch processing?

A

Processes small batches of data every 100ms and provides a “once-and-only-once” guarantee.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is continuous stream processing?

A

Processes data with latency as low as 1ms but offers an “at-least-once” guarantee, allowing duplicates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some challenges with streaming?

A

Late events (events arriving later than expected can affect state consistency). End-to-End guarantees (requires fault tolerance across source, processing and sink stages), code portability (Batch API and Streaming API are similar but require minor modifications for compatibility).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Spark Streaming Architecture?

A

Source (data is read), Sink (data is written), State (intermediate computations) and Result Table (processed results).

*triggers occur at specific intervals to process the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the output modes in Spark Streaming?

A

Complete Output Mode, Update Output Mode and Append Output Mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Complete Output Mode?

A

Writes the entire result table to the sink during each trigger, used for aggregate queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Update Output Mode?

A

Updates rows in the result table without rewriting unchanged rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Append Output Mode?

A

Only adds new rows to the result table; no updates allowed. Aggregations with watermarks can be used in append mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the key limitations of output modes?

A

Append Mode cannot support queries requiring updates and Complete Mode does not support non-aggregate queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which ones support aggregate queries and which ones don’t?

A

Complete Mode: Works best for aggregate queries because it rewrites the entire result table.

Append Mode: Does not support aggregate queries unless watermarks are used because it only appends new records without updating existing ones.

Update Mode: Can work with aggregate and non-aggregate queries, as it updates rows incrementally.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly