Execution models for distributed computing Flashcards
1
Q
Big data properties
A
- Volume: The quantity of generated and stored data
- Velocity: The speed at which the data is generated and processed
- Variety: The type and nature of the set
- Variability: Inconsistency of the data set
- Veracity: The quality of captured data
2
Q
Types of data declustering
A
- Attribute-less partitioning
-> Random
-> Round-Robin - Single Attribute Schemes
-> Hash De-clustering
-> Range De-clustering - Multiple Attributes schemes possible
-> MAGIC, BERD, etc.
3
Q
Declustering tradeoffs between types
A
- When selectivity is low, range spread out load and was ideal
- When selectivity is increased, range causes high workload on one/few nodes while Hash spread out load
4
Q
MapReduce
A
- Programming and implementation model to process big data in clusters
- Everything is a <key, value> pair
- Three steps:
-> Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed.
-> Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node.
-> Reduce: worker nodes now process each group of output data, per key, in parallel.
5
Q
Problems with MapReduce
A
- Performance: Extensive I/O
-> Everything is a file stored in hard disk, no distributed memory - Programming model: Limited expressiveness
-> E.g. iterations, cyclic processes
-> Procedural code in map and reduce, difficult to optimize
6
Q
Resilient Distributed Datasets (RDD)
A
- Distributed, fault-tolerant collections of elements that can be operated in parallel
- Core properties:
-> Immutable
-> Distributed
-> Lazily evaluated
-> Cacheable
-> Replicated on request - Contains details about the data (data location or data itself)
- Contains history to enable recreating a lost split of an RDD