Execution models for distributed computing Flashcards

1
Q

Big data properties

A
  • Volume: The quantity of generated and stored data
  • Velocity: The speed at which the data is generated and processed
  • Variety: The type and nature of the set
  • Variability: Inconsistency of the data set
  • Veracity: The quality of captured data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of data declustering

A
  • Attribute-less partitioning
    -> Random
    -> Round-Robin
  • Single Attribute Schemes
    -> Hash De-clustering
    -> Range De-clustering
  • Multiple Attributes schemes possible
    -> MAGIC, BERD, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Declustering tradeoffs between types

A
  • When selectivity is low, range spread out load and was ideal
  • When selectivity is increased, range causes high workload on one/few nodes while Hash spread out load
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

MapReduce

A
  • Programming and implementation model to process big data in clusters
  • Everything is a <key, value> pair
  • Three steps:
    -> Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed.
    -> Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node.
    -> Reduce: worker nodes now process each group of output data, per key, in parallel.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Problems with MapReduce

A
  • Performance: Extensive I/O
    -> Everything is a file stored in hard disk, no distributed memory
  • Programming model: Limited expressiveness
    -> E.g. iterations, cyclic processes
    -> Procedural code in map and reduce, difficult to optimize
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Resilient Distributed Datasets (RDD)

A
  • Distributed, fault-tolerant collections of elements that can be operated in parallel
  • Core properties:
    -> Immutable
    -> Distributed
    -> Lazily evaluated
    -> Cacheable
    -> Replicated on request
  • Contains details about the data (data location or data itself)
  • Contains history to enable recreating a lost split of an RDD
How well did you know this?
1
Not at all
2
3
4
5
Perfectly