Execution models for distributed computing Flashcards

Question 1

Q

Big data properties

Answer

A

Question 2

Q

Types of data declustering

Answer

A

Question 3

Q

Declustering tradeoffs between types

Answer

A

When selectivity is low, range spread out load and was ideal
When selectivity is increased, range causes high workload on one/few nodes while Hash spread out load

Question 4

Q

MapReduce

Answer

A

Programming and implementation model to process big data in clusters
Everything is a <key, value> pair
Three steps:
-> Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed.
-> Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node.
-> Reduce: worker nodes now process each group of output data, per key, in parallel.

Question 5

Q

Problems with MapReduce

Answer

A

Performance: Extensive I/O
-> Everything is a file stored in hard disk, no distributed memory
Programming model: Limited expressiveness
-> E.g. iterations, cyclic processes
-> Procedural code in map and reduce, difficult to optimize

Question 6

Q

Resilient Distributed Datasets (RDD)

Answer

A

Distributed, fault-tolerant collections of elements that can be operated in parallel
Core properties:
-> Immutable
-> Distributed
-> Lazily evaluated
-> Cacheable
-> Replicated on request
Contains details about the data (data location or data itself)
Contains history to enable recreating a lost split of an RDD

Brainscape's Knowledge GenomeTM