Module 7(a+b) - Hadoop MapReduce Flashcards

1
Q

<p>What is lambda calculus?</p>

A

<p>a formal system in mathematical logic for expressing computation based on function abstraction using variable binding and substitution</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

<p>What does treating functions "anonymously" mean?</p>

A

<p>Not bounding the function to an identifier, or a name</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

<p>MapReduce performs \_\_\_\_\_\_\_ computation on \_\_\_\_\_\_\_ volumes of data</p>

A

parallel

large

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

<p>In Hadoop MapReduce, are components allowed to share data arbitrarily? why is it like this in terms of scalability?</p>

A

<p>Components are not allowed to share data arbitrarily.<br></br>
<br></br>
The overhead required to keep data synchronized across components would hurt the system's scalability</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

<p>Are data elements in MapReduce immutable or mutable?</p>

A

<p>Data elements in MapReduce are immutable</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

<p>How does communication occur in MapReduce? (with the assistance of the hadoop system)</p>

A

<p>By generating new outputs, which are then forwarded by the Hadoop system to the next phase of execution</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

<p>How many times does a SINGLE MapReduce program transform lists of input data to lists of output data?
<br></br>
<br></br>Explain</p>

A

<p>Twice. MapReduce uses two different list processing idioms: "map" and "reduce"
<br></br>
<br></br>They're inspired by functional programming paradigms</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

<p>If MapReduce was a black box, what would be the input and output of this box?</p>

A

<p><strong>input</strong>: lists of input data elements, as a file (which is loaded using HDFS)<br></br>
<br></br>
<strong>output</strong>: lists of output data elements, as a file (which is generated using HDFS)</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

<p>The first phase of a MapReduce program is "mapping" how does it work?</p>

A

<p>A list of data elements are provided (loaded from a file using HDFS), one at a time to a function called the "mapper" which transforms each element individually to one output data element, or sometimes zero or more outputs. It does this by applying a function on each element in the list, and storing the output in a list iself</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

<p>What is the primitive purpose of the "reducer" in MapReduce? How does the reducing process work?</p>

A

<p>The "reducer" lets the system aggregate values together.</p>

<p>- Reducer function receives an iterator of input values from an input list</p>

<p>- Combines these values together, returning a single output value</p>

<p>Example: combute the sum of elements in a list</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

<p>Describe the overall data flow of MapReduce fromthe mapper to the reducer</p>

A

<p>- Mapper pre-loads local input data</p>

<p>- Mapper loads Immediate data from input arrays</p>

<p>- Values are exchanged and shuffled (in between mapper and reducer)</p>

<p>- Reducing process generates the outputs from the data inherited</p>

<p>- Outputs are stored from reducers</p>

<p>Overall, workflow ismap->shuffle->reduce</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

<p>What is the interface that is used forinputs to be loaded from the file system in Hadoop</p>

A

Hadoop Distributed File System (HDFS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

<p>Suppose we have 2 nodes which are both running a MapRedue program which inherits inputs from a file. These 2 programs are running in parallel.</p>

<p>1. Describe the detailed workflow of the program in terms of method calls for each node (around 8 steps)</p>

<p>2. The point at which they would typically interact</p>

A
  1. For both Nodes, files are loaded into input using HDFS
  2. Inputs are split up using the split(), which converts it to an array
  3. The inherited array is passed into aRecordReader() which breaks up the data into (key, value) pairs
  4. The (key, value) pairs are passed into the map() function which performs the lambda function on all the inputs(this is the mapper)
  5. The output of the map function is passed into the partitioner, which shuffles the outputs across the 2 nodes (this is where they interact)
  6. The output is passed into the sort() method to organize the data
  7. The sorted output is passed into the reducer to combine the cluster of outputs
  8. Outputs are written to the filesystem using HDFS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is an “inputSplit” in Hadoop MapReduce?
What does it corresponds to with respect to the input file?
What does a record in a file correspond to?

A
  • InputSplit is a unit of work which is assigned to one map task. It is simply an element in the list of items that is passed in as input
  • Usually corresponds to a chunk of an input file (or a word)
  • Each record in a file corresponds to exactly one input split.

The framework takes care of dealing with record boundaries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

<p>What is meant by "inputFormat" in MapReduce? What is it a factory for?</p>

A

The “inputFormat” determines how the input files are parsed, and defines the input splits (how the records are sperated)

It is the factory for RecordReader objects.
Ex: TextInputFormat, SequenceFileInputFormat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

<p>What is the RecordReader in MapReduce?</p>

A

<p>RecordReader loads data from an InputSplit and creates key-value pairsfor the mapper (breaks the data ino key-value pairs).</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the Partitioner in MapReduce? Where is it placed in the architecture?

A
  • The Partitioner determines which partition that a given key-value pair should go to.
  • Partitioner sits in between the mapper and the reducer.
  • The default partitioner simply hashes the key emitted by the mapper.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

<p>What is the OutputFormat in MapReduce? What is it a factory of?</p>

A

The OutputFormat determines how the output files are formatted. It is a factory for RecordWriter objects.
Ex: TextOutputFormat, SequenceFileOutputFormat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

<p>What is the "RecordWriter" in MapReduce?</p>

A

Writes records (such as key-value pairs) into output files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

<p>How does Hadoop primarily achieve fault tolerance?</p>

A
  • Restarting tasks

- Creating replicas

21
Q

<p>What is a TaskTracker and a JobTracker?</p>

A

Tasktracker - individual task nodes

Jobtracker - the head node of the system

22
Q

How does Hadoop know to restart tasks and maintain synchronization?

A

The TaskTrackers (individual task nodes) are always in constant communication with the JobTracker (head node of the system)

The JobTracker will know if a task needs to be restarted, and will be able to assign it accordingly

23
Q

<p>What happens if a TaskTracker fails to communicate with a JobTracker for a period of time (lets say 1 minute)</p>

A

<p>If there is no communication for 1 minute, then the JobTracker will assume that this TaskTracker has crashed</p>

24
Q

<p>Suppose that MapReduce is in the mapping phase, and a TaskTracker fails, what happens to the map tasks of the failed task node?</p>

A

<p>The TaskTrackers which are still running will be asked to re-execute the all the map tasks which where run by the failing TaskTracker</p>

25
Q

<p>Suppose that MapReduce is in the reducing phase and a TaskTracker fails. What happens to the reduce tasks of the failed reducer?</p>

A

The other TaskTrackers that didn’t fail will re-execute the reduce tasks that failed

26
Q

Where is the input and output of MapReduce typically stored?

A

In the file system

27
Q

MapReduce must be able to tolerate faults ______ and have no ______ form restarting tasks. The system will be able to ______ handle internal components failing/restarting

A

smoothly
side-effects
autonomously

28
Q

<p>What is a typical performance problem with the Hadoop system when it comes to dividing tasks across many nodes?</p>

A

<p>It is possible for a few slow nodes to rate-limit (bottleneck) the rest of the program. These nodes are known as stragglers.</p>

29
Q

In MapReduce, individual tasks do not know where their inputs came from. (they have no context). Tasks are run in isolation.

What is the reason for this?

A
  • If one node fails, it doesn’t necessarily take down the system with it
  • Since there is no context needed to execute a task, the system can assign the same task to numerous nodes in parallel and improve fault tolerance
30
Q

<p>What is the purpose of the "mapper" in MapReduce?</p>

A

<p>The mapper processes the input data. It transforms each element individually to one output data element, or sometimes zero or more output elements.</p>

<p>Ex: convert each line of input text to uppercase</p>

31
Q

<p>Sometimes Hadoop executes the same task numerous times in parallelwhen there are more compute resources than required.
Why does this happen? and when?</p>

A
  • For fault tolerance.
  • As most of the tasks in a job are coming to a close, the Hadoop system will schedule redundant copies of the remaining tasks across several nodes which have no other work to do.

<p>This ensures that if 1 node fails, there are others who can finish the job. It is known as speculative execution</p>

32
Q

In MapReduce, does the input need to be in one file for the system to work?
What about the output? How many output files are there per reducer?

A

There is no such requirement for the system. The input can be scattered across one or more files.
Same with the output. There is 1 output file per reducer.

33
Q

<p>What does the pseudocode for the primitive mapper (without hashmap)function look like?</p>

A

<p>mapper(position, line):</p>

<p> <strong>for each</strong> word<strong>in</strong>line:</p>

<p> <strong>emit</strong>(word, 1)</p>

34
Q

<p>What exactly does the term "emit" mean in the mapper orreducer?</p>

A

<p>Create an output document associating key with value</p>

35
Q

<p>What does the pseudocode for the primitive reducer (without hashmap)look like?</p>

A

<p>reducer (word, values):<br></br>
sum := 0<br></br>
<strong>for each</strong> value <strong>in</strong> values:<br></br>
sum := sum + value<br></br>
<strong>emit</strong> (word, sum)</p>

36
Q

<p>The mapper can aggregate the frequency for each word in a document using a hashmap (associative array). Write the pseudocode for the mapper which emits a word with its frequency in the document.</p>

<p>What is the tradeoff with using this approach?</p>

A

(pseudocode inthe image)

Tradeoff: this uses more memory than the primitive approach (for every word)

37
Q

What is the purpose of a combiner with respect to the mapper?
When is it executed in a typical MapReduce program?

A

A combiner is used to aggregate counters across all the documents processed by a map task - similar purpose as the mapper with a hashmap (associative array).

The combiner is executed after the map stage and before the shuffle phase

38
Q

<p>In MapReduce, the combiner is used to aggregate counters across words in the document (similar to a frequency map). Write the Pseudocode for the combinerassuming that it interfaces with the primitive mapper</p>

A

<p>The combiner and Mapper classes are included</p>

39
Q

<p>What does "selection" do with regard to the mapper function in MapReduce?</p>

<p>Is a reducer required? How does it work?</p>

<p>What does the pseudocode look like?</p>

A
  • Selection returns a subset of the input elements that satisfy some predicate (ex: x < 10). It is basically a filtering scheme
  • Only a mapper is required (no reducer is needed).
  • The framework will generate one output per map task.
40
Q

What does Projection do in MapReduce?
What is the Reducer neededfor in this context?
What does the pseudocode look like?

A
  • Projection returns a subset of the fields of each input element (ex: [x,y,z] to [x,y] )
  • The reducer is needed to eliminate duplicates
41
Q

What is the cross-correlation problem statement?
If the input is size N, what would be the output size?
Is MapReduce helpful?

A
  • There is a list of items, for each possible pair of items, calculate the number of pairs which these items co-occur.
  • For an input of size N, the output will be size N^2
  • MapReduce allows for this problem to get scaled out
42
Q

<p>The cross-correlation problem has 2 approaches. One of them is faster and one of them is slower. What are they?</p>

A

<p>Pairs Approach: Slower, and uses less memory</p>

<p>Stripes Approach: Faster and uses more memory</p>

43
Q

<p>Describe the "Stripes" approach to solving the cross-correlation problem</p>

A

<p>The faster approach. Uses a frequency hashmap to store the count of an item in the input array - therefore it requiress more memory. Is more complicated than the Pairs approach</p>

44
Q

<p>Describe the Pairs approach to solve the cross-correlation problem</p>

A

<p>Simple approach that does not use any memory structure. Is slow.</p>

45
Q

<p>What is the <strong>inputs</strong> and <strong>outputs</strong>of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the <b>Pairs Approach</b>:</p>

R1 = “a dog”
R2 = “a cat”

A

.

46
Q

<p>What is the <strong>inputs</strong> and <strong>outputs</strong>of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the <strong>Stripes Approach</strong>:</p>

<p>R1 = "a dog"</p>

<p>R2 = "a cat"</p>

A

.

47
Q

<p>What is the <strong>inputs</strong> and <strong>outputs</strong>of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the <strong>Stripes Approach</strong>:</p>

R1 = “a big dog”
R2 = “a small cat”

A

<p></p>

48
Q

<p>What is the <strong>inputs </strong>and <strong>outputs</strong>of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the <strong>Pairs Approach</strong>:</p>

<p>R1 = "a big dog"</p>

<p>R2 = "a small cat"</p>

A
49
Q

The Hadoop MapReduce framework takes care of ______ tasks, ______ them and ______ the failed tasks

A

scheduling
monitoring
re-executes