MapReduce Flashcards

Question 1

Q

What does the map-step in MapReduce do?

Answer

A

Processes each input file and provides the data in key-value pairs ().

Question 2

Q

What does the reduce-step in MapReduce do? (Combine and reduce)

Answer

A

Combine: Will accept the input from the Map phase as a key-value pair. Using searching technique, the combiner will check all the values to find the key corresponding to the highest value in each file.

Reduce: From each file, you will find the key corresponding to the highest value. To avoid redundancy, check all the pairs and eliminate duplicate entries, if any. The same algorithm is used in between the pairs, which are coming from input files.

Question 3

Q

What is the general solution to the MapReduce problem?

Answer

A

map(k,v) -> list(k1,v1)

reduce(k1,list(v1)) -> (k1,v2)

Question 4

Q

How can MapReduce solve the matrix-vector multiplication problem if …
a) M huge, v small enough to be kept in RAM
b) Both M and v too big to be kept in RAM
Where M is the nxn matrix represented by a list of
(i,j, m_ij) triplets and v is the n-length vector?

Answer

A

a) map(line_id, (i, j, m_ij) -> [i, m_ijv_j]
reduce(i, [m_i1v_1, m_i2v_2, …])-> (i,sum(m_ijv_j))
b) Split M into stripes of k columns and v into chunks of length k and apply algorithm from a), summing up the results

Question 5

Q

Size of element in M or V in mapreduce matrix?

Answer

A

8 byte/64 bit

Question 6

Q

How can MapReduce solve the matrix-matrix multiplication problem? (input: M, i, j, m_ij | N, j, k, n_jk)

P = MN is given by:
p_ik = sum_j(m_ij*n_jk)

Answer

A

for each m_ij and n_jk:
map((“M”, i, j, m_ij)) -> [j, (“M”, i, m_ij)]
map(“N”, j, k, n_jk) -> [j,(“N”, k, n_jk)]

reduce(j, [(“M”, i, m_ij), (“N”, k, n_jk)])
-> emit([(i,k), m_ij*n_jk]), for all possible (i,k)

To get p_ik, run another MapReduce to sum up all the terms with key (i, k)!

Question 7

Q

What is hdfs (Hadoop distributed file system)?

Answer

A

An open-source DFS used with hadoop

Question 8

Q

What is hadoop?

Answer

A

An open-source implementation of MapReduce.

Question 9

Q

What is the probability of losing data to for example hdfs?

Answer

A

P(Data loss) = kp^r
k = no of partitions/node * no of nodes
p = probability of losing an individual node
r = replication factor

MapReduce Flashcards

(9 cards)