Module 7(a+b) - Hadoop MapReduce Flashcards
<p>What is lambda calculus?</p>
<p>a formal system in mathematical logic for expressing computation based on function abstraction using variable binding and substitution</p>
<p>What does treating functions "anonymously" mean?</p>
<p>Not bounding the function to an identifier, or a name</p>
<p>MapReduce performs \_\_\_\_\_\_\_ computation on \_\_\_\_\_\_\_ volumes of data</p>
parallel
large
<p>In Hadoop MapReduce, are components allowed to share data arbitrarily? why is it like this in terms of scalability?</p>
<p>Components are not allowed to share data arbitrarily.<br></br>
<br></br>
The overhead required to keep data synchronized across components would hurt the system's scalability</p>
<p>Are data elements in MapReduce immutable or mutable?</p>
<p>Data elements in MapReduce are immutable</p>
<p>How does communication occur in MapReduce? (with the assistance of the hadoop system)</p>
<p>By generating new outputs, which are then forwarded by the Hadoop system to the next phase of execution</p>
<p>How many times does a SINGLE MapReduce program transform lists of input data to lists of output data?
<br></br>
<br></br>Explain</p>
<p>Twice. MapReduce uses two different list processing idioms: "map" and "reduce"
<br></br>
<br></br>They're inspired by functional programming paradigms</p>
<p>If MapReduce was a black box, what would be the input and output of this box?</p>
<p><strong>input</strong>: lists of input data elements, as a file (which is loaded using HDFS)<br></br>
<br></br>
<strong>output</strong>: lists of output data elements, as a file (which is generated using HDFS)</p>
<p>The first phase of a MapReduce program is "mapping" how does it work?</p>
<p>A list of data elements are provided (loaded from a file using HDFS), one at a time to a function called the "mapper" which transforms each element individually to one output data element, or sometimes zero or more outputs. It does this by applying a function on each element in the list, and storing the output in a list iself</p>
<p>What is the primitive purpose of the "reducer" in MapReduce? How does the reducing process work?</p>
<p>The "reducer" lets the system aggregate values together.</p>
<p>- Reducer function receives an iterator of input values from an input list</p>
<p>- Combines these values together, returning a single output value</p>
<p>Example: combute the sum of elements in a list</p>
<p>Describe the overall data flow of MapReduce fromthe mapper to the reducer</p>
<p>- Mapper pre-loads local input data</p>
<p>- Mapper loads Immediate data from input arrays</p>
<p>- Values are exchanged and shuffled (in between mapper and reducer)</p>
<p>- Reducing process generates the outputs from the data inherited</p>
<p>- Outputs are stored from reducers</p>
<p>Overall, workflow ismap->shuffle->reduce</p>
<p>What is the interface that is used forinputs to be loaded from the file system in Hadoop</p>
Hadoop Distributed File System (HDFS)
<p>Suppose we have 2 nodes which are both running a MapRedue program which inherits inputs from a file. These 2 programs are running in parallel.</p>
<p>1. Describe the detailed workflow of the program in terms of method calls for each node (around 8 steps)</p>
<p>2. The point at which they would typically interact</p>
- For both Nodes, files are loaded into input using HDFS
- Inputs are split up using the split(), which converts it to an array
- The inherited array is passed into aRecordReader() which breaks up the data into (key, value) pairs
- The (key, value) pairs are passed into the map() function which performs the lambda function on all the inputs(this is the mapper)
- The output of the map function is passed into the partitioner, which shuffles the outputs across the 2 nodes (this is where they interact)
- The output is passed into the sort() method to organize the data
- The sorted output is passed into the reducer to combine the cluster of outputs
- Outputs are written to the filesystem using HDFS
What is an “inputSplit” in Hadoop MapReduce?
What does it corresponds to with respect to the input file?
What does a record in a file correspond to?
- InputSplit is a unit of work which is assigned to one map task. It is simply an element in the list of items that is passed in as input
- Usually corresponds to a chunk of an input file (or a word)
- Each record in a file corresponds to exactly one input split.
The framework takes care of dealing with record boundaries
<p>What is meant by "inputFormat" in MapReduce? What is it a factory for?</p>
The “inputFormat” determines how the input files are parsed, and defines the input splits (how the records are sperated)
It is the factory for RecordReader objects.
Ex: TextInputFormat, SequenceFileInputFormat
<p>What is the RecordReader in MapReduce?</p>
<p>RecordReader loads data from an InputSplit and creates key-value pairsfor the mapper (breaks the data ino key-value pairs).</p>
What is the Partitioner in MapReduce? Where is it placed in the architecture?
- The Partitioner determines which partition that a given key-value pair should go to.
- Partitioner sits in between the mapper and the reducer.
- The default partitioner simply hashes the key emitted by the mapper.
<p>What is the OutputFormat in MapReduce? What is it a factory of?</p>
The OutputFormat determines how the output files are formatted. It is a factory for RecordWriter objects.
Ex: TextOutputFormat, SequenceFileOutputFormat
<p>What is the "RecordWriter" in MapReduce?</p>
Writes records (such as key-value pairs) into output files