Spark Libraries Flashcards
Briefly summarize the two Spark streaming paradigms
- Descretized Stream (DStream): Break the stream to micro-batches, and process each micro-batch as an RDD. Spark treats each batch of data as RDDs and processes them using RDD operations. Processed results are returned in batches (in the same order).
- Structured streaming: Structured Streaming: A high-level API built on top of the Spark SQL engine, which processes data as an unbounded table with columns and rows. Allows you to express complex streaming computations using DataFrame and Dataset APIs (Spark SQL), and the system automatically manages the streaming, checkpointing, and fault-tolerance aspects. It also comes with the Catalyst optimization out of the box because of this.
When using DStreams, how can we make spark incrementally undo reduce operations with older batches when new batches come in? What are the advantages and give one scenario when would be useful.
We can supply a invertible reduce function next to the standard reduce function in .reduceByKeyAndWindow:
lines.flatMap(…).mapToPair(…)
.reduceByKeyAndWindow((i1,i2)->i1+i2,
(i1,i2)->i1-i2, Durations.Seconds(1000))
Here (i1,i2)->i1-i2 is the invertible reduce function which tells spark how to incrementally remove the results from older batches. This makes it so that we don’t have to calculate the whole window again, but we just have to calculate the results for new arrivals and remove the old ones.
This will save much computation when the sliding window contains many batches. For example if you have a batch size of 1 second and your sliding window is 24 hours, then you will not have to compute the whole 24 hours worth of batches all over after every second, but you can just forget the oldest second.
Which of these reduce functions are invertible?
A. Count
B. Addition to a set
C. Addition to a multiset
D. Keeping only the most frequent items
Briefly explain each one.
A. Count: Yes, we can increment the count on arrival and decrement it when it falls out of the window.
B. Addition to a set: No. Sets don’t allow duplicates, so we cannot tell if the we have previously added an item twice and can now remove it safely.
C. Addition to a multiset: Yes, multisets allow duplicates, so we can just remove the element.
D. Keeping only the most frequent items: Not straightforward, but possible with specialized algorithms. Keeping track of frequencies over a stream is expensive and requires auxilary datastructures and/or algorithms.
Which output modes are there in Structured streaming?
- Complete mode: The entire updated result table is emitted to the sink for each trigger interval (includes old and new results).
- Update mode: Only the rows in the result table that have changed (i.e. updated) or are new since the last trigger interval are emitted to the sink.
- Append mode: only new rows that have been added to the result table since the last trigger interval are emitted to the sink.
Provide two downsides of DStreams.
- It’s not declarative: Harder to write and optimize
- The input of RDDs are untyped, which require lots of processing like parsing.
Which types of sliding windows are there in spark structured streaming?
- Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. An input can only be bound to a single window.
- Sliding windows are similar to the tumbling windows from the point of being “fixed-sized”, but windows can overlap if the duration of slide is smaller than the duration of window, and in this case an input can be bound to the multiple windows.
- Session windows have different characteristic compared to the previous two types. Session window has a dynamic size of the window length, depending on the inputs. A session window starts with an input, and expands itself if following input has been received within gap duration. For static gap duration, a session window closes when there’s no input received within gap duration after receiving the latest input.
Some example:
~~~
Dataset<Row> events = ... // streaming DataFrame of schema { timestamp: Timestamp, userId: String }</Row>
// Group the data by session window and userId, and compute the count of each group
Dataset<Row> sessionizedCounts = events
.withWatermark("timestamp", "10 minutes")
.groupBy(
session_window(col("timestamp"), "5 minutes"),
col("userId"))
.count();
~~~</Row>
What determines the performance of structured streaming?
- Declarative language -> catalyst optimizations
- Processing of dataframes/datasets faster than processing of RDDs
When is DStream preferred over Structured Streaming?
When you have Legacy code that already performs operations on RDDs and it takes a lot of time to make it compatible with Dataframes/datasets.
Explain what BlinkDB does and which mechanism is important.
BlinkDB/Taster are used perform queries very fast by retrieving approximations. Instead of applying a query on the whole dataset, they use clever sampling on which to perform the query on. You can provide a target execution time, OR an error rate and confidence interval for this query. These parameters will influence each other.
These types of queries are called Interactive Queries, and can be useful for IoT applications.
How does Taster differ from BlinkDB
Samples are dropped and loaded in in an online fashion. Samples that are no longer relevant are expelled from the query.
Briefly explain the following Spark DStream Streaming commands:
- transform
- transformToPair
- join
- The transform operation takes a function as an argument, which receives the RDD of the input DStream, applies the transformation, and returns a new RDD. The output of the transform operation is a new DStream generated from the transformed RDDs.
- transformToPair is a specialized version of the transform operation that is used when you need to apply an RDD-to-RDD function to a PairDStream (DStream of key-value pairs). The function passed to transformToPair should take an RDD of key-value pairs as input and return a new RDD of key-value pairs.
- join is a DStream operation that allows you to join two DStreams based on their keys. The result of the join operation is a new DStream containing key-value pairs where each key exists in both input DStreams, and the values are combined into a tuple.