Map-Reduce Flashcards
Objective
Sequentially read a lot of data
Phases
Map phase
Group by key
Reduce phase
Write the result
Map
read the input and produce key,value pairs
For each work output its count
Sort & Shuffle
performed by the system
Reduce
collect values with the same key and produce
What is programmer responsible for
- Map function
- Reduce function
What is the MapReduce system responsible for
- Partitioning the input data
- Scheduling the program’s execution across a set of machines
- Performing the sort by key & shuffle step
- Handling machine failures
- Managing required inter-machine communication
Master node
Master node coordinates the
execution:
Task status: (idle, in-progress,
completed)
Idle tasks get scheduled as
workers become available
When a map task completes, it
sends the master the location and
sizes of its intermediate files, one
for each reducer
Master pushes this info to
reducers
Master pings workers periodically
to detect failures
Worker node
Worker node performs
map or reduce tasks, as
requested by the
coordinator.
Map worker failure
Upon detection of the failure
of a worker, map tasks
restarted in different worker
Reduce worker failure
Reduce task is restarted in
other worker
Stragglers (slow workers)
If a task is taking too long to
complete, it is launched in
other worker. First result used.
Master failure
MapReduce task is aborted
and client is notified