BigData - Spark Flashcards
What is spark
“Engine” for distributed
data processing over a cluster:
What does it mean that RDDs are resilient and immutable?
Resilient: Meaning able to be recreated from history
Immutable: Can not be modified
RDD in Spark
Spark Distributes the data in RDD, to different nodes across the cluster to achieve paralleliezation
The data in RDD: is partitioned to different nodes and are done parallel
what is a Partition?
Is a batch of data operated on in parallel
Explain - Dependencies between RDDs: Narrow dependencies.
Narrow dependencies
Each partition of the parent RDD(s) is used by at most one
partition of the child RDD.
(1) operations with narrow dependencies are “pipelined” locally on one cluster node;
(2) if one partition is lost, recomputation is also local.
Explain - Dependencies between RDDs: Wide dependencies.
A parent partition has multiple child partitions depending on it.
(1) operations with wide dependencies require data from all parent partitions to be shuffled on the network (MapReduce-like);
(2) if one partition is lost, you may need to recompute the entire program!
Spark streaming: Batching
Streaming input data is batched (discretized):
At small time interval (every 1 second);
The smaller this interval, the lower the latency;
Each batch is processed with Spark.
D-Stream and RDD
Each D-Steam periodically generates an RDD, either from live data or by transforming the RDD generated by a parent D-Stream
Then groups all records from past sliding windows into the RDD
Spark Streaming: Guarantees
D-Streams (and RDDs) track their lineage (the graph of transformations)
- at the level of partitions within each RDD
- when a node fails, its RDD partitions are rebuilt on other machines from the original input data stored in the cluster.
D-Streams provide consistent, exactly-once processing.