Big Data Lecture 08 Massive Parallel Processing I_ Map Reduce Flashcards
Explain the basic paradigm of processing in MapReduce.
<ol><li><span>Map: [key, value] -> [key, value] apply transformation in parallel to all the key value pairs store in your storage,</span></li><li><span>Shuffle: order the data by key,</span></li><li><span>Reduce: [key, [values]] -> [key, value] one job per key, apply transformation to the keys to derive information.</span></li></ol>
What are the data types in the different stages of MapReduce?
They are always key-value pairs, just for the reducer, they are summarized by the key, but they still remain as they are.
On what quantity of data can MapReduce work and on how many nodes?
<ul><li>TBs of data,</li><li>1000s of nodes.</li></ul>
What is the architecture of MapReduce on Hadoop?
Central architecture as in previous cases, mane node is called JobTracker (collides with NameNode) and the other nodes are called TaskTrackers (=DataNode).<br></br><img></img>
What does ‘bringing query to the data’ mean?
Queries should be executed as close to the hardware as possible. In practice, this means shipping a jar to the TaskTracker.
How are key-values pairs stored during the operation?
In most cases in the memory, but if need be, they can be flushed to the disk and at the same moment compacted (using Log-Structured Merge-Trees).
How are final outputs of MapReduce stored?
As shards, output split into blocks of more manageable size.
How does the input have to be formatted for MapReduce?
It has to be made into key value pairs, which is sometimes not very practical, e.g. with text, which needs to be tokenized for this.
How long does it take for job on 1000 node cluster to run?
Several hours at least, since there are very many nodes, and that causes overhead!
Explain how to count words in a document using MapReduce?
<ol><li>Map words to {word: 1},</li><li>Reduce per word to count how many there are of each.</li></ol>
How is MapReduce optimized? When is it possible?
Combine data (using reduction function) on flushing or compacting of the data.<br></br><br></br>It is possible when the reduction operation is associative and commutative;<br></br>and key-value types must be identical for input and output of reduction!
What is a split, task and a slot?
Splits of the data are delivered to slots to be processed. On delivery, each split is processed in a task. Within each slot, processing is sequential.
What is a combine task and a combine phase?
No such thing exists!
How many slots per task should we allocate?
There are two guidelines:<br></br><ul><li>0.95 slots per task, so that there are almost none left over,</li><li>1.75, to double the ratio, and then assign new shifts.</li></ul>
Does map/reduce task end up mapping exactly to one HDFS slot?
No, unfortunately not. Usually the key-value pairs do not match up perfectly most of the time. This means that we have to read from previous/next HDFS block.