BigData - Spark Flashcards

Question 1

Q

What is spark

Answer

A

“Engine” for distributed
data processing over a cluster:

Question 2

Q

What does it mean that RDDs are resilient and immutable?

Answer

A

Resilient: Meaning able to be recreated from history
Immutable: Can not be modified

Question 3

Q

RDD in Spark

Answer

A

Spark Distributes the data in RDD, to different nodes across the cluster to achieve paralleliezation
The data in RDD: is partitioned to different nodes and are done parallel

Question 4

Q

what is a Partition?

Answer

A

Is a batch of data operated on in parallel

Question 5

Q

Explain - Dependencies between RDDs: Narrow dependencies.

Answer

A

Narrow dependencies
Each partition of the parent RDD(s) is used by at most one
partition of the child RDD.
(1) operations with narrow dependencies are “pipelined” locally on one cluster node;
(2) if one partition is lost, recomputation is also local.

Question 6

Q

Explain - Dependencies between RDDs: Wide dependencies.

Answer

A

A parent partition has multiple child partitions depending on it.
(1) operations with wide dependencies require data from all parent partitions to be shuffled on the network (MapReduce-like);
(2) if one partition is lost, you may need to recompute the entire program!

Question 7

Q

Spark streaming: Batching

Answer

A

Streaming input data is batched (discretized):
At small time interval (every 1 second);
The smaller this interval, the lower the latency;
Each batch is processed with Spark.

Question 8

Q

D-Stream and RDD

Answer

A

Each D-Steam periodically generates an RDD, either from live data or by transforming the RDD generated by a parent D-Stream
Then groups all records from past sliding windows into the RDD

Question 9

Q

Spark Streaming: Guarantees

Answer

A

D-Streams (and RDDs) track their lineage (the graph of transformations)
- at the level of partitions within each RDD
- when a node fails, its RDD partitions are rebuilt on other machines from the original input data stored in the cluster.
D-Streams provide consistent, exactly-once processing.

Question 10

Q

BigData - Spark Flashcards

(10 cards)