MapReduce Flashcards

1
Q

What does the map-step in MapReduce do?

A

Processes each input file and provides the data in key-value pairs ().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the reduce-step in MapReduce do? (Combine and reduce)

A

Combine: Will accept the input from the Map phase as a key-value pair. Using searching technique, the combiner will check all the values to find the key corresponding to the highest value in each file.

Reduce: From each file, you will find the key corresponding to the highest value. To avoid redundancy, check all the pairs and eliminate duplicate entries, if any. The same algorithm is used in between the pairs, which are coming from input files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the general solution to the MapReduce problem?

A

map(k,v) -> list(k1,v1)

reduce(k1,list(v1)) -> (k1,v2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can MapReduce solve the matrix-vector multiplication problem if …
a) M huge, v small enough to be kept in RAM
b) Both M and v too big to be kept in RAM
Where M is the nxn matrix represented by a list of
(i,j, m_ij) triplets and v is the n-length vector?

A

a) map(line_id, (i, j, m_ij) -> [i, m_ijv_j]
reduce(i, [m_i1
v_1, m_i2v_2, …])-> (i,sum(m_ijv_j))
b) Split M into stripes of k columns and v into chunks of length k and apply algorithm from a), summing up the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Size of element in M or V in mapreduce matrix?

A

8 byte/64 bit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can MapReduce solve the matrix-matrix multiplication problem? (input: M, i, j, m_ij | N, j, k, n_jk)

P = MN is given by:
p_ik = sum_j(m_ij*n_jk)
A

for each m_ij and n_jk:
map((“M”, i, j, m_ij)) -> [j, (“M”, i, m_ij)]
map(“N”, j, k, n_jk) -> [j,(“N”, k, n_jk)]

reduce(j, [(“M”, i, m_ij), (“N”, k, n_jk)])
-> emit([(i,k), m_ij*n_jk]), for all possible (i,k)

To get p_ik, run another MapReduce to sum up all the terms with key (i, k)!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is hdfs (Hadoop distributed file system)?

A

An open-source DFS used with hadoop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is hadoop?

A

An open-source implementation of MapReduce.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the probability of losing data to for example hdfs?

A

P(Data loss) = kp^r
k = no of partitions/node * no of nodes
p = probability of losing an individual node
r = replication factor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly