Lesson 12: Distributed Data Analytics Flashcards
What is the focus of distributed data processing frameworks? What are some examples?
To provide the programming and runtime systems for scalable data processing in distributed systems.
MapReduce, Spark
What are some strategies for processing data at scale?
Data Parallel (Divide & Conquer):
- divide data, assign to nodes for processing
- Assumption: it’s possible to partition data with good load balancing across nodes
Pipelining:
- divide work into smaller tasks. each node is specialized for one/some tasks.
Model Parallelism:
- divide the state of the application across nodes. each node has less to process based on its subset of state. input is passed to all nodes and each node processes its slice of the models; output combined from all nodes. nodes may need to communicate depending on dependencies
What are some key design decisions in MapReduce?
Master data structures:
- for tracking progress
Locality:
- scheduling, placement of intermediate data
Task granularity
- finer granularity -> more flexibility, impact on execution time of management operations
- higher granularity -> lower management overheads
Fault tolerance:
- master: standby replication
- worker: detect failures or stragglers and re-execute (intermediate files a plus)
Semantics in the presence of failures:
- importance of consistency and complete results?
Backup tasks:
- failures inevitable => speculatively backup tasks
MapReduce uses intermediate files to store results from each operation in the pipeline. What are the pros/cons of this?
+ guards against lost work in failure cases
- high i/o overhead -> serialization to/from persistent storage is costly
How is the key idea in Spark? How does it address the main drawback of MapReduce’s intermediate file design?
Spark allows in-memory data sharing but still archives fault tolerance.
- Fast DRAM instead of slow HDD
- Avoids serialization overhead
What are the pros/cons of Spark?
+ less data to persist in critical execution path
+ read data as low as once, less storage I/O
+ more control of locality
- recovery time
Spark is great when you have a workload that needs high throughput (lots of writes) and these operations can be specified at course granularity.