Big Data Lecture 08 Massive Parallel Processing I_ Map Reduce Flashcards

1
Q

Explain the basic paradigm of processing in MapReduce.

A

<ol><li><span>Map: [key, value] -&gt; [key, value] apply transformation in parallel to all the key value pairs store in your storage,</span></li><li><span>Shuffle: order the data by key,</span></li><li><span>Reduce: [key, [values]] -&gt; [key, value] one job per key, apply transformation to the keys to derive information.</span></li></ol>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the data types in the different stages of MapReduce?

A

They are always key-value pairs, just for the reducer, they are summarized by the key, but they still remain as they are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

On what quantity of data can MapReduce work and on how many nodes?

A

<ul><li>TBs of data,</li><li>1000s of nodes.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the architecture of MapReduce on Hadoop?

A

Central architecture as in previous cases, mane node is called JobTracker (collides with NameNode) and the other nodes are called TaskTrackers (=DataNode).<br></br><img></img>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does ‘bringing query to the data’ mean?

A

Queries should be executed as close to the hardware as possible. In practice, this means shipping a jar to the TaskTracker.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are key-values pairs stored during the operation?

A

In most cases in the memory, but if need be, they can be flushed to the disk and at the same moment compacted (using Log-Structured Merge-Trees).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are final outputs of MapReduce stored?

A

As shards, output split into blocks of more manageable size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does the input have to be formatted for MapReduce?

A

It has to be made into key value pairs, which is sometimes not very practical, e.g. with text, which needs to be tokenized for this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How long does it take for job on 1000 node cluster to run?

A

Several hours at least, since there are very many nodes, and that causes overhead!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain how to count words in a document using MapReduce?

A

<ol><li>Map words to {word: 1},</li><li>Reduce per word to count how many there are of each.</li></ol>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is MapReduce optimized? When is it possible?

A

Combine data (using reduction function) on flushing or compacting of the data.<br></br><br></br>It is possible when the reduction operation is associative and commutative;<br></br>and key-value types must be identical for input and output of reduction!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a split, task and a slot?

A

Splits of the data are delivered to slots to be processed. On delivery, each split is processed in a task. Within each slot, processing is sequential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a combine task and a combine phase?

A

No such thing exists!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How many slots per task should we allocate?

A

There are two guidelines:<br></br><ul><li>0.95 slots per task, so that there are almost none left over,</li><li>1.75, to double the ratio, and then assign new shifts.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Does map/reduce task end up mapping exactly to one HDFS slot?

A

No, unfortunately not. Usually the key-value pairs do not match up perfectly most of the time. This means that we have to read from previous/next HDFS block.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly