MapReduce Flashcards
Explain the Map and Reduce function of the MapReduce programming paradigm.
MapReduce is a programming paradigm and an execution engine, built within Hadoop.
Mapper: The Mapper takes in input data and processes it into intermediate key-value pairs. The input data can be in any format, and the mapper can perform any operations on it. The output of the mapper is a sequence of key-value pairs that represent the intermediate results of the computation. These key-value pairs are then sent to the reducer for further processing.
Reducer: The Reducer takes in the intermediate key-value pairs generated by the Mapper and performs further computation on them. The input to the Reducer is a set of key-value pairs with the same key, generated by the Mapper. The Reducer can perform any operation on this set of key-value pairs, and the output is another set of key-value pairs representing the final results of the computation. The output of the Reducer is written to disk as the final output of the MapReduce job.
We can think of it like an SQL analogue:
SELECT SUM (salary*12) FROM Employees GROUP BY deptId
Here the salary*12 is the Map and the SUM and GROUPBY functions are the Reducer.
Explain how we can obtain the word count across a large file using MapReduce
- Mapper reads input data in fixed-size blocks.
- Mapper splits each block into individual tokens (words).
- Mapper writes a key-value pair to the context for each word, where the key is the word and the value is the count (always 1).
- MapReduce framework sorts and shuffles intermediate key-value pairs by key.
- Sorted key-value pairs are partitioned by keys and assigned to specific reducers for processing.
- Within each partition, reducer processes key-value pairs in order determined by sorting step.
- Reducer counts occurrence of each individual word.
- Reducer produces another sequence of key-value pairs representing final output.
- Final output is stored on disk.
Provide pseudocode for the Mapper and Reducer used in the word-count problem.
Mapper:
~~~
function mapper(doc_id, doc_text):
for each word in doc_text:
emit(word, 1)
~~~
Reducer:
~~~
function reducer(word, counts):
total = 0
for count in counts:
total += count
emit(word, total)
~~~
What are the upsides of using a paradigm like MapReduce?
It abstracts a lot of complexity away, by providing an easy-to-learn interface for the developer. The developer does not need to know where the data is located, how to partition files, how to handle networking and node management. Hadoop’s MapReduce takes care of this behinds the screen.
What is the most expensive phase of MapReduce? Explain the most important components of this phase.
The shuffle phase is most expensive.
During the shuffle phase, the framework needs to move intermediate key-value pairs from the mappers to the reducers, sort them by key, and partition them based on the hash function to determine which reducer will process each key. This increases network load by a lot, which could be a bottleneck if the network is not fast enough.
It is also possible that if the hash function is not well-defined, the workload can be distributed unevenly across the Reducers and therefore can create bottlenecks in the process.
Explain the hash function of the shuffle phase of the MapReduce framework.
During the shuffle phase in MapReduce, the framework partitions the sorted key-value pairs by the keys and then sends each partition to a specific reducer. The partitioning is done using a hash function, which maps each key to a specific reducer.
The hash function in MapReduce is typically a hash algorithm that generates a 32-bit hash value from the key. The hash value is then used to determine which reducer will receive the key-value pair. The hash function should be designed to evenly distribute the keys among the reducers, so that the workload is balanced and each reducer receives roughly the same amount of data to process.
The goal of the hash function is to ensure that all key-value pairs with the same key are sent to the same reducer, while also distributing the workload evenly across the reducers. This helps to minimize data transfer and reduce the processing time for the entire MapReduce job.
Explain how jobs are created and tracked under the hood in MapReduce
- You upload a file to HDFS
- You send a job to the job tracker which is located in the Namenode.
- Job tracker deploys job to task trackers, which are started locally on the Datanodes.
- Task trackers start the job locally and monitor it.
- Task trackers report back to the job tracker when they are done.
This means that code is executed close to the data, which minimizes data transfer over the network.
What is a Combiner function? How can it increase performance?
Let’s say we have a problem where we tokenize a large file. The tokens contain country codes (NL, BE, etc.). It would be wasteful to send each individual key-value pair over the network to the reducer, as there are a limited number of country codes that could be found. In these cases it would be more efficient to do some local (on the same datanode) aggregation before sending the intermediate results to the Reducer.
This local aggregation is defined in the Combiner, which is essentially a mini-reducer. This combiner is an optional step in the MapReduce process and can drastically increase the efficiency of the shuffling phase.
Note: the execution of the combiner is not guaranteed