Spark Flashcards
What are Spark RDDs? Why was Spark so revolutionary?
RDD stands for Resilient Distributed Datasets. They made datasets look like collections.
RDDs reside on disk, but can be cached in memory, are transparently distributed and feature all FP programming primitives.
What is the difference between RDDs and Pair RDDs? Why do we need both?
RDDs are tuples, and are handled as iterable, whereas Pair RDDs are seens as KV pairs, and can be handled as both iterable and indexable.
Pair RDDs allow for more functionality (e.g. join).
What are the key Spark API calls?
There are two types of operations for Spark: Transformations and Actions.
Common transformations are: Map, flatMap, and filter. Pair RDDs also allow for reduceByKey, aggregateByKey and join.
Common actions are: collect, take, reduce, fold and aggregate.
What are wide and narrow dependencies?
Wide dependencies means that multiple partitions in the target RDD depend on a single partition in the source RDD (ie, one partition of the source can lead to multiple partitions in the target)
Narrow dependencies means that each partition of the source RDD is used by at most one partition of the target RDD (think of it as in injective function from the target to the source)
How does Spark deal with faults?
Spark uses RDD lineage info to know which partition(s) to recompute in case of node failure.
Spark also uses checkpointing to save states reliably. This effectively truncates the RDD lineage graph.
What types of partitioning can we employ for distributed systems like Spark?
There are three types of partitioning:
Default partitioning: Split into equally large partitioning, without looking at the data properties
Range partitioning: Look at the natural order of keys to split the dataset into the required number of partitions. This requires naturally ordered keys that are equally distributed across the value range.
Hash partitioning: Calculate the hash of each item key and produce the modulo of this hash to determine the new partition.
How does Catalyst optimize queries?
By using range conditions to restrict data volumes. This also allows for parallel querying.
Doing relational algebra optimizations
Recursive tree substitutions
How does Spark schedule jobs on a cluster?
A job is initiated when an action is called on an RDD. The dependency graph is evaluated backwards and a graph of stages is built. Each stage consists of multiple tasks, which are scheduled in parallel on cluster nodes
Which RDD API call is a ‘performance killer’ in Spark?
GroupByKey, because it shuffles (rearranges) ALL the data.
What methods shuffle the the data?
groupByKey, reduceByKey, combineByKey and Join.