Map Reduce Flashcards
Which technology is Map Reduce a part of?
Hadoop. Hadoop consists of HDFS and Map Reduce
What is the data format of the Input in Map Reduce?
(Key, Value) pairs, of arbitrary serializable types, that should fit in memory
What strategy should be employed when cluster components fail during computation in Map Reduce?
To address cluster component failures, it is advisable to parallelize computation into small tasks. In the event that a task fails to deliver results, the recommended approach is to restart that specific task.
Where does the data come from in Map Reduce?
The (H)DFS
What are the 4 steps in the Map task?
- Read (key, value) pairs in input, from the DFS
- One Map task per pair (Will be scheduled on/near the machine where the input is).
- Computes a number of (key, value) pairs, decided by you
- Outputs to the (local!) disk in a buffer region
What is the main operation of the Shuffle (Master controller) task?
Keeps track of the (key, value) pairs in the output of all Map tasks. It then does a distributed group by key operation, which outputs the key(s) and its list of values
What 3 qualities defines the Reduce task?
- One reduce works on one key at a time
- Computes a combined value per key, decided by you
- The output is saved to (H)DFS files (one reduce per task)
What are the 3 switch levels of data acquisition from HDFS to the MapReduce task, in order of fastest to slowest?
- Data Local (On the same machine in the rack)
- Rack Local (On the same rack, but different machines)
- Off Rack (Between racks)
In general terms, what does the MapReduce task do?
It compresses several data entries of the same value, to a single self-specified new key, value (often count). Example: given the input “w1, w2, w3, w2, w3, w3, w3”, the output could be “(w1, 1), (w2, 2), (w3, 4)”.