Lecture 2 Flashcards

The MapReduce Framework

1
Q

What is needed for cluster computing?

A
  • Raise the level of abstraction
  • View the whole cluster as a single machine
  • Develop programming frameworks on that high level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the aims of cluster computing?

A
  • Communicate less
  • Read data from the local node (if not, sequentially in large granularity)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What rules does MapReduce follow when spreading the computation over a cluster?

A
  • Move computation to data (minimize bandwidth use)
  • Communicate in large transfers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the name of the file system that MapReduce rely on?

A

Hadoop Distributed File System

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does HDFS work?

A
  • stores files in large blocks of 64MB and replicates these blocks on three machines
  • HDFS files cannot be modified, just appended to
    *
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does HDFS keep track of all files, blocks and on which machines they are stored on?

A

A special master node called namenode keeps track on datanodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does HDFS store data on datanodes? What happens if one goes down? What is the relation between Linux and HDFS?

A

HDFS in principle stores all blocks of a HDFS file on the same three datanodes together, and the datanode which originally wrote the data always is one of them.

Though, if a datanodes goes down, the namenode will detect this and replicate the block on another datanode to compensate.

The datanodes (tend to) run Linux, and store all HDFS data together as Linux files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does MapReduce framework work?

A
  • The user is expected to write a Map() and Reduce() function, both of which map (key,value) pairs into other (key,value) pairs. Both Map() and Reduce() functions can emit zero or more result pairs for each input pair.
  • The reducer receives as input (key,value*): the second parameter is the full list of all values that were emitted by the mappers for that key - so each key is handled by exactly one Reduce() call.
  • Optionally, a Combine() function may be placed in between Map() and Reduce(), which can reduce the amount of communication between mappers and reducers. Its input and output parameters are the same format as the input of the Reduce() function.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does MapReduce optimize data movement?

A

Ask the locations of the HDFS input file and assign files to mappers on machines where the data is on the local disk. Then assign the files to mappers on machines where the data is stored locally.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly