MapReduce - Week 7 Flashcards

1
Q

Batch Processing

A

jobs that can run without end user interaction, or can be scheduled to run as resources permit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Examples of batch processing for data sets that build over time

A

Web crawling
Transaction logs, for analysing trends
Equipment logs, for predicting faults

Huge data sets that may need to be processed on parallel architectures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Who originally developed map reduce

A

Google

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What two functions make up map reduce?

A

map and reduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

MapReduce - map function definition

A

map(key1, value1) -> [(key2, value2)]

Given a key and a value, generates a collection of key value pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

MapReduce - reduce function definition

A

reduce(key2, [value2]) -> [(key3,value3)]

given a key key2 output by map, and a collection of all the values value2 associated with that key, return a new collection of key-value pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Word count with map reduce - what do the two functions do?

A

Map takes a document, and returns a set of word counts for that document.
e.g.
“the map operation given…” -> {“the”: 1, “map”:1, …}

Reduce takes outputs from map and collates them into one thing
{“the”: [1,1], “map”:[1,1} -> {“the”: 2, “map”: 2}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

MapReduce provider, extensions and competitors

A

Hadoop, …

Extensions: Cloudera

Competitors: Apache Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

AWS EC2

A

Purchase of virtual machines of different capabilities, with different operating systems and for different periods

IaaS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

AWS S3

A

Purchase of storage that is accessed through a simple file system style interface

IaaS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

EMR (Elastic Map Reduce)

A

The ability to run scalable applications written using the map reduce programming model over EC2 and S3 infrastructure

PaaS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is S3 used for AWS MapReduce?

A

The input to the map/reduce problem

The Jar that contains the program

The output from the execution of the program

Logging information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Use Map reduce or RDB for single batch tasks?

A

MapReduce, perhaps the effort of loading the data into a relational database isn’t worth it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Use Map Reduce or RDB for data using online transactional processing and analytical tasks?

A

Map reduce won’t help with the OLTP tasks.

A relational database is more flexible and may be able to handle both, though often different systems are used for OLTP and analytics to avoid contention for resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Use Map Reduce or RDB for data that needs fine-grained access control

A

MapReduce itself doesn’t provide much in the way of security - the hosting environment does that.

Certain relational databases will provide fine-grained access control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

MapReduce - what is a job?

A

The unit of work to be performed (the data and the program)

Can consist of several map and reduce tasks

17
Q

MapReduce - what is a split?

A

A part of the input (e.g. a 64mb filesystem block)

18
Q

MapReduce - what is a task?

A

Map or reduce functions created and run for each split/partition

19
Q

MapReduce - what is a task tracker?

A

Tracks the progress of each of the map or reduce tasks on a node

Keeps the job tracker informed of progress

20
Q

MapReduce - what is a job tracker?

A

Coordinates the different tasks comprising a job

21
Q

MapReduce - How many map tasks are created?

A

One for each split (each part of the input, e.g. a 64mb filesystem block)

22
Q

MapReduce - What are the steps of the map function?

A
  1. The map task runs on the split, creating key-value pairs
  2. The output of the map is partitioned into groups, for sending to reduce functions, typically by hashing.
  3. The partitions are sorted by key
  4. The outputs are written to the local file system
  5. The task manager notifies the job manager that the task is completed
23
Q

MapReduce - What are the steps of the reduce function?

A
  1. Relevant map partitions are copied to the associated reduce nodes
  2. Data from different maps is merged to produce the inputs for individual reduce operations
  3. The reduce task is run
  4. Outputs are written to the distributed filesystem
24
Q

What are the MapReduce performance issues to look out for?

A

Memory usage - The space required by the code within map and reduce

Skew - The likelihood that data is not distributed evenly across reduce nodes

Intermediate result size - The amount of data that is produced by map or reduce compared to the size of their inputs

25
Q

Write the Basket analysis average problem as MapReduce pseudocode

A

A list of the prices of the baskets a user has purchased, e.g.

001 - £26
002 - £30
001 - £40
002 - £35

A simple analysis involves getting the average a user has spent

Map receives a key (identifying the location of an input split) and a value (a section of memory, a list of the users and values)
It emits those user and value pairs

The reduce function gets a customer id and a list of basket values for that customer from the map function, it emits the average of these values with the key of the customer id it received

26
Q

MapReduce - summarisation patterns

A

Summarization aims to send as little information to reducers as possible, e.g.
“hello hello hello” -> {“hello”, 3}

instead of
“hello hello hello” -> {“hello” 1, “hello” 1, “hello” 1}

To do this numerically the operation must be both:
associative - (axb)xc = ax(bxc)
commutative - axb = bxa

27
Q

MapReduce - combiners

A

An optional reducer that is local to a map, and that summarises its results in some way

Must have inputs and outputs that are semantically compatible with the output of a map

In word count, the normal reducer can be used as a combiner, in this case it has the same effect as summarisation

28
Q

MapReduce - inverted index pattern

A

supports the construction of an index from the contents of a document to the document

Map returns emits a key for each token, and a value of the document identifier

Reduce is application specific, might compute tf-idf or write to an index structure

29
Q

MapReduce - Filtering pattern

A

Discard some of the information in the input

Sampling - producing representative examples

Top-k - choosing the best examples

Distinct - removing duplicates

For word count - may only be interested in counting occurrences of dictionary words

30
Q

MapReduce - Join Pattern

A

Implements a relational join

In Hadoop a job can be associated with two mappers that read from different inputs; then the infrastructure does most of the join

Need to label tuples with their table table, so we can distinguish them in the reduce.

For full explanation see “Join Pattern” slide in the “Map Reduce Programming - Week 7” notes