Map/reduce Flashcards

Question 1

Q

what is Map/Reduce?

Answer

A

a programming framework for which any solution must be expressed using map & reduce functions

Question 2

Q

what is the function of the map function?

Answer

A

to emit intermediate key/value pairs when called on an input item

Question 3

Q

what is the function of the reduce function?

Answer

A

to emit results for a key when called on groups of pairs with the same key

Question 4

Q

what are the benefits of MapReduce? (3)

Answer

A

provides a high-level parallel programming abstraction
the framework implementations provide good performance results
greatly reduces parallel programming complexity

Question 5

Q

where are the opportunities for parallelism when using MapReduce? (2)

Answer

A

input data can be partitioned into chunks, each of which can be assigned to a different mapper
the map stage produces collections of key/value pairs, that are grouped by key. Each distinct key-group can be sent to a different reducer

Question 6

Q

what does the reduce function receive?

Answer

A

a key/value pair (intermediate keys), where the new value is a list of all the input values from the grouped key/value pairs with the same key

Question 7

Q

when will reduce jobs run?

Answer

A

only once all map jobs are completed and the synchronisation step occurs

Question 8

Q

what occurs during the synchronisation step? (6)

Answer

A

Every key-value item generated by the mappers is collected
Items are transferred over the network
Same key items are grouped into a list of value
Data is partitioned among the number of Reducers
Data is copied over the network to each Reducer
The data provided to each Reducer is sorted according to the keys

Question 9

Q

how does the framework partition key value pairs to the reducers?

Answer

A

randomly, based on the number of distinct keys generated

Question 10

Q

why is it not possible to always achieve load balancing between reducers?

Answer

A

some distinct keys may appear more times than others so the reducer that the distinct key is assigned to will have more values to work with

Question 11

Q

what are the two bottlenecks associated with the map/reduce framework?

Answer

A

the synchronisation step & the network (communication speed)

Question 12

Q

where are key/value pairs produced by the map function stored?

Answer

A

in memory when up to 100MB, then remaining key/value pairs are stored on hard disk

Question 13

Q

what is partitioning?

Answer

A

the process of dividing the all the intermediate key/value pairs produced by the map functions into groups or partitions based on the keys (each partition is typically associated with a range of keys)

Question 14

Q

what is the number of partitions equal to?

Answer

A

the total number of reducers/reduce jobs

Question 15

Q

what is the purpose of the combiner?

Answer

A

to improve efficiency by reducing the communication volume, acting as a preliminary reducer to perform a local aggregation of the output from the mappers

Question 16

Q

when does partitioning occur?

Answer

A

after mapping and before shuffle & sort

Question 17

Q

what is shuffle & sort?

Answer

A

the process of transferring the relevant partitions to the appropriate nodes, and it sorting the data based on keys

Question 18

Q

when does shuffle & sort occur?

Answer

A

after partitioning and before reducing

Question 19

Q

is the combiner mandatory?

Question 20

Q

give some examples of map/reduce implementations (4)

Answer

A

top-k
inverted index
filtering
numerical sumarisation

Question 21

Q

why can top-k only run with one reducer?

Answer

A

because all key/value pairs need to be sent to a single reducer for sorting

Question 22

Q

what is the minimum requirement for top-k?

Answer

A

the ranking data for a whole input split must fit into memory of a single reducer

Question 23

Q

is calculating an average using map/reduce an associative operation?