Lecture 2 Flashcards

Question 1

Q

What is needed for cluster computing?

Answer

A

Raise the level of abstraction
View the whole cluster as a single machine
Develop programming frameworks on that high level

Question 2

Q

What are the aims of cluster computing?

Answer

A

Communicate less
Read data from the local node (if not, sequentially in large granularity)

Question 3

Q

What rules does MapReduce follow when spreading the computation over a cluster?

Answer

A

Move computation to data (minimize bandwidth use)
Communicate in large transfers

Question 4

Q

What is the name of the file system that MapReduce rely on?

Answer

A

Hadoop Distributed File System

Question 5

Q

How does HDFS work?

Answer

A

stores files in large blocks of 64MB and replicates these blocks on three machines
HDFS files cannot be modified, just appended to
*

Question 6

Q

How does HDFS keep track of all files, blocks and on which machines they are stored on?

Answer

A

A special master node called namenode keeps track on datanodes.

Question 7

Q

How does HDFS store data on datanodes? What happens if one goes down? What is the relation between Linux and HDFS?

Answer

A

HDFS in principle stores all blocks of a HDFS file on the same three datanodes together, and the datanode which originally wrote the data always is one of them.

Though, if a datanodes goes down, the namenode will detect this and replicate the block on another datanode to compensate.

The datanodes (tend to) run Linux, and store all HDFS data together as Linux files.

Question 8

Q

How does MapReduce framework work?

Answer

A

The user is expected to write a Map() and Reduce() function, both of which map (key,value) pairs into other (key,value) pairs. Both Map() and Reduce() functions can emit zero or more result pairs for each input pair.
The reducer receives as input (key,value*): the second parameter is the full list of all values that were emitted by the mappers for that key - so each key is handled by exactly one Reduce() call.
Optionally, a Combine() function may be placed in between Map() and Reduce(), which can reduce the amount of communication between mappers and reducers. Its input and output parameters are the same format as the input of the Reduce() function.

Question 9

Q

How does MapReduce optimize data movement?

Answer

A

Ask the locations of the HDFS input file and assign files to mappers on machines where the data is on the local disk. Then assign the files to mappers on machines where the data is stored locally.

Lecture 2 Flashcards

The MapReduce Framework (9 cards)