Map/reduce Flashcards
what is Map/Reduce?
a programming framework for which any solution must be expressed using map & reduce functions
what is the function of the map function?
to emit intermediate key/value pairs when called on an input item
what is the function of the reduce function?
to emit results for a key when called on groups of pairs with the same key
what are the benefits of MapReduce? (3)
- provides a high-level parallel programming abstraction
- the framework implementations provide good performance results
- greatly reduces parallel programming complexity
where are the opportunities for parallelism when using MapReduce? (2)
- input data can be partitioned into chunks, each of which can be assigned to a different mapper
- the map stage produces collections of key/value pairs, that are grouped by key. Each distinct key-group can be sent to a different reducer
what does the reduce function receive?
a key/value pair (intermediate keys), where the new value is a list of all the input values from the grouped key/value pairs with the same key
when will reduce jobs run?
only once all map jobs are completed and the synchronisation step occurs
what occurs during the synchronisation step? (6)
- Every key-value item generated by the mappers is collected
- Items are transferred over the network
- Same key items are grouped into a list of value
- Data is partitioned among the number of Reducers
- Data is copied over the network to each Reducer
- The data provided to each Reducer is sorted according to the keys
how does the framework partition key value pairs to the reducers?
randomly, based on the number of distinct keys generated
why is it not possible to always achieve load balancing between reducers?
some distinct keys may appear more times than others so the reducer that the distinct key is assigned to will have more values to work with
what are the two bottlenecks associated with the map/reduce framework?
the synchronisation step & the network (communication speed)
where are key/value pairs produced by the map function stored?
in memory when up to 100MB, then remaining key/value pairs are stored on hard disk
what is partitioning?
the process of dividing the all the intermediate key/value pairs produced by the map functions into groups or partitions based on the keys (each partition is typically associated with a range of keys)
what is the number of partitions equal to?
the total number of reducers/reduce jobs
what is the purpose of the combiner?
to improve efficiency by reducing the communication volume, acting as a preliminary reducer to perform a local aggregation of the output from the mappers