Lesson 12: Distributed Data Analytics Flashcards

Question 1

Q

What is the focus of distributed data processing frameworks? What are some examples?

Answer

A

To provide the programming and runtime systems for scalable data processing in distributed systems.

MapReduce, Spark

Question 2

Q

What are some strategies for processing data at scale?

Answer

A

Data Parallel (Divide & Conquer):
- divide data, assign to nodes for processing
- Assumption: it’s possible to partition data with good load balancing across nodes

Pipelining:
- divide work into smaller tasks. each node is specialized for one/some tasks.

Model Parallelism:
- divide the state of the application across nodes. each node has less to process based on its subset of state. input is passed to all nodes and each node processes its slice of the models; output combined from all nodes. nodes may need to communicate depending on dependencies

Question 3

Q

What are some key design decisions in MapReduce?

Answer

A

Master data structures:
- for tracking progress

Locality:
- scheduling, placement of intermediate data

Task granularity
- finer granularity -> more flexibility, impact on execution time of management operations
- higher granularity -> lower management overheads

Fault tolerance:
- master: standby replication
- worker: detect failures or stragglers and re-execute (intermediate files a plus)

Semantics in the presence of failures:
- importance of consistency and complete results?

Backup tasks:
- failures inevitable => speculatively backup tasks

Question 4

Q

MapReduce uses intermediate files to store results from each operation in the pipeline. What are the pros/cons of this?

Answer

A

+ guards against lost work in failure cases
- high i/o overhead -> serialization to/from persistent storage is costly

Question 5

Q

How is the key idea in Spark? How does it address the main drawback of MapReduce’s intermediate file design?

Answer

A

Spark allows in-memory data sharing but still archives fault tolerance.
- Fast DRAM instead of slow HDD
- Avoids serialization overhead

Question 6

Q

What are the pros/cons of Spark?

Answer

A

+ less data to persist in critical execution path
+ read data as low as once, less storage I/O
+ more control of locality
- recovery time

Spark is great when you have a workload that needs high throughput (lots of writes) and these operations can be specified at course granularity.

Lesson 12: Distributed Data Analytics Flashcards

(6 cards)