Big Data Lecture 08 Massive Parallel Processing I_ Map Reduce Flashcards

Question 1

Q

Explain the basic paradigm of processing in MapReduce.

Answer

A

<ol><li>Map: [key, value] -> [key, value] apply transformation in parallel to all the key value pairs store in your storage,</li><li>Shuffle: order the data by key,</li><li>Reduce: [key, [values]] -> [key, value] one job per key, apply transformation to the keys to derive information.</li></ol>

Question 2

Q

What are the data types in the different stages of MapReduce?

Answer

A

They are always key-value pairs, just for the reducer, they are summarized by the key, but they still remain as they are.

Question 3

Q

On what quantity of data can MapReduce work and on how many nodes?

Answer

A

<ul><li>TBs of data,</li><li>1000s of nodes.</li></ul>

Question 4

Q

What is the architecture of MapReduce on Hadoop?

Answer

A

Central architecture as in previous cases, mane node is called JobTracker (collides with NameNode) and the other nodes are called TaskTrackers (=DataNode). <img></img>

Question 5

Q

What does ‘bringing query to the data’ mean?

Answer

A

Queries should be executed as close to the hardware as possible. In practice, this means shipping a jar to the TaskTracker.

Question 6

Q

How are key-values pairs stored during the operation?

Answer

A

In most cases in the memory, but if need be, they can be flushed to the disk and at the same moment compacted (using Log-Structured Merge-Trees).

Question 7

Q

How are final outputs of MapReduce stored?

Answer

A

As shards, output split into blocks of more manageable size.

Question 8

Q

How does the input have to be formatted for MapReduce?

Answer

A

It has to be made into key value pairs, which is sometimes not very practical, e.g. with text, which needs to be tokenized for this.

Question 9

Q

How long does it take for job on 1000 node cluster to run?

Answer

A

Several hours at least, since there are very many nodes, and that causes overhead!

Question 10

Q

Explain how to count words in a document using MapReduce?

Answer

A

<ol><li>Map words to {word: 1},</li><li>Reduce per word to count how many there are of each.</li></ol>

Question 11

Q

How is MapReduce optimized? When is it possible?

Answer

A

Combine data (using reduction function) on flushing or compacting of the data. It is possible when the reduction operation is associative and commutative; and key-value types must be identical for input and output of reduction!

Question 12

Q

What is a split, task and a slot?

Answer

A

Splits of the data are delivered to slots to be processed. On delivery, each split is processed in a task. Within each slot, processing is sequential.

Question 13

Q

What is a combine task and a combine phase?

Answer

A

No such thing exists!

Question 14

Q

How many slots per task should we allocate?

Answer

A

There are two guidelines: <ul><li>0.95 slots per task, so that there are almost none left over,</li><li>1.75, to double the ratio, and then assign new shifts.</li></ul>

Question 15

Q

Does map/reduce task end up mapping exactly to one HDFS slot?

Answer

A

No, unfortunately not. Usually the key-value pairs do not match up perfectly most of the time. This means that we have to read from previous/next HDFS block.

Big Data Lecture 08 Massive Parallel Processing I_ Map Reduce Flashcards

(15 cards)