Map/reduce Flashcards

1
Q

what is Map/Reduce?

A

a programming framework for which any solution must be expressed using map & reduce functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the function of the map function?

A

to emit intermediate key/value pairs when called on an input item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the function of the reduce function?

A

to emit results for a key when called on groups of pairs with the same key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the benefits of MapReduce? (3)

A
  1. provides a high-level parallel programming abstraction
  2. the framework implementations provide good performance results
  3. greatly reduces parallel programming complexity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

where are the opportunities for parallelism when using MapReduce? (2)

A
  1. input data can be partitioned into chunks, each of which can be assigned to a different mapper
  2. the map stage produces collections of key/value pairs, that are grouped by key. Each distinct key-group can be sent to a different reducer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what does the reduce function receive?

A

a key/value pair (intermediate keys), where the new value is a list of all the input values from the grouped key/value pairs with the same key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

when will reduce jobs run?

A

only once all map jobs are completed and the synchronisation step occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what occurs during the synchronisation step? (6)

A
  1. Every key-value item generated by the mappers is collected
  2. Items are transferred over the network
  3. Same key items are grouped into a list of value
  4. Data is partitioned among the number of Reducers
  5. Data is copied over the network to each Reducer
  6. The data provided to each Reducer is sorted according to the keys
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how does the framework partition key value pairs to the reducers?

A

randomly, based on the number of distinct keys generated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

why is it not possible to always achieve load balancing between reducers?

A

some distinct keys may appear more times than others so the reducer that the distinct key is assigned to will have more values to work with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are the two bottlenecks associated with the map/reduce framework?

A

the synchronisation step & the network (communication speed)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

where are key/value pairs produced by the map function stored?

A

in memory when up to 100MB, then remaining key/value pairs are stored on hard disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is partitioning?

A

the process of dividing the all the intermediate key/value pairs produced by the map functions into groups or partitions based on the keys (each partition is typically associated with a range of keys)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the number of partitions equal to?

A

the total number of reducers/reduce jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is the purpose of the combiner?

A

to improve efficiency by reducing the communication volume, acting as a preliminary reducer to perform a local aggregation of the output from the mappers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

when does partitioning occur?

A

after mapping and before shuffle & sort

17
Q

what is shuffle & sort?

A

the process of transferring the relevant partitions to the appropriate nodes, and it sorting the data based on keys

18
Q

when does shuffle & sort occur?

A

after partitioning and before reducing

19
Q

is the combiner mandatory?

A

no

20
Q

give some examples of map/reduce implementations (4)

A
  1. top-k
  2. inverted index
  3. filtering
  4. numerical sumarisation
21
Q

why can top-k only run with one reducer?

A

because all key/value pairs need to be sent to a single reducer for sorting

22
Q

what is the minimum requirement for top-k?

A

the ranking data for a whole input split must fit into memory of a single reducer

23
Q

is calculating an average using map/reduce an associative operation?

A

no