Chapter 5: Data-Intensive Applications Flashcards
What are the challenges of large-scale data processing?
Writing efficient parallel and distributed processing jobs is hard
■ Extremely high performance/throughput possible
■ Problem: Highly parallel environment
■ Developers dont want to deal with concurrency issues, fault tolerance
● Needed: Suitable abstraction layer for developers
What are the three Requirements for Data- Intensive Applications as an abstraction layer?
- Developers don‘t have to think about parallelization
♦ Can continue to write sequential code
♦ Code is independent of degree of parallelism at runtime - Developers don‘t have to think about fault tolerance
♦ Abstraction layer takes care of failed nodes
♦ Re-executes lost parts of computations if necessary - Developers don‘t have to think about load balancing
♦ Abstraction layer is in charge of distributing the work evenly across the available compute nodes
What is Map and Reduce?
MapReduce is a Programming Model.
Map and reduce is a second order function which takes first order functions provided by the developer as an input. It operates on a key-value model, meaning data is passed as KV pairs at all phases.
What are the Signature and Guarantees of the Map Function?
Signature: (k1, v1) → list(k2, v2)
● Guarantees to the first-order function
■ First-order function is invoked once for each KV pair
■ Can produce a [0,*] KV pairs as output
● Useful for projections, selection, …
What are the Signature and Guarantees of the Reduce Function?
Signature: (k2, list(v2)) → list(k3, v3)
● Guarantees to the first-order function
■ All KV-pairs with the same key are presented to the same invocation of the first-order function
● Useful for aggregation, grouping, …
Name the five steps of Map and Reduce.
- Input Data
- Map Phase
- Shuffle Phase - Group intermediate results of map phase by key
- Reduce phase
- Output Data
Explain how MapReduce works. (Example Word Count)
- Input:KV pairs as are transferd as an Input for the Map function
- Map Phase: Map function is executed on the KV pair > Intermediate results for each word per line
- Schuffle phase groups the results from the map phase by key
- Reduce Phase: execute function (count) on the grouped KV pairs.
Explain the Map function?
■ First-order function provided by user
■ Specifies what happens to the data in job‘s map phase
Explain the Mapper?
■ A process running on a worker node
■ Invokes map function for each KV pair
Explain the Reduce function?
■ First-order function provided by user
■ Specifies what happens to the data in job‘s reduce phase
Explain the Reducer?
■ Process invoking reduce function on grouped data
How does the Distributed Execution of MapReduce work?
- Client partitions input file into input splits
- Client submits job to master
- Mapper started for each input splits
- Reducers pull data from mappers over network
What are the Limitation of MapReduce?
- Assumes finite input (files only)
Limitation of finite input prevents streaming processing - Data between MR jobs must go to Google File System
Constraint to write to GFS especially detrimental for iterative algorithms
What are three scenarios of possbile MapReduce fault?
- Mapper fails
- Reducer fails
- Entire worker node fails
How can a Mapper fault be corrected?
■ Master detects failure through missing status report
■ Mapper is restarted on diff. node, re-reads data from GFS