Map Reduce & Spark Flashcards

1
Q

What is a Resilient Distributed Dataset (RDD)?

A
Fundamental abstraction in Spark. An RDD is a collection of
elements of the same type, partitioned and distributed across
several machines (if available).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two types of operations that can be performed on an RDD A?

A

• Narrow transformation: each partition of A contributes to
(at most) one partition of B, which is stored in the same machine.
No shuffling of data across machines is needed (⇒ maximum parallelism).

E.g. functions: mapToPair(f), flatMapToPair(f), mapValues(f), flatMapValues(f).

• Wide transformation: each partition of A may contribute to many partitions of B.
Hence, shuffling of data across machines may be required.

E.g. functions: groupByKey(), groupBy(f), reduceByKey(f).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can you implement a MapReduce round in Spark?

A

• Map Phase: on the input RDD X invoke one of map
methods offered by Spark (narrow transformation) passing the
desired map function as argument.

• Reduce Phase: on the RDD X′ resulting from the Map Phase:
1. invoke one of grouping methods offered by Spark (wide
transformation) to group the key-value pairs into
key-ListOfValues pairs;
2. invoke one of map methods offered by Spark to apply the
desired reduce function to each key-ListOfValues pair.

      (the reduceByKey method allows you to do both steps at once)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The analysis of an MR algorithm aims at estimating which key

performance indicators?

A

• Number of rounds R.

• Local space ML: maximum amount of main memory required by a single invocation of a map or reduce function, for storing the input and any data structure needed by the invocation. The maximum is
taken over all rounds and all invocations in each round.

• Aggregate space MA: maximum amount of (disk) space which is
occupied by the stored data at the beginning/end of a map or
reduce phase. The maximum is taken over all rounds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the main design goals of MapReduce algorithms?

A
  • Few rounds (e.g., R = O (1));
  • Sublinear local space (e.g., ML = O (|input|^ϵ), with ϵ < 1);
  • Linear aggregate space (i.e., MA = O (|input|)), or only slightly superlinear;
  • Low complexity of each map or reduce function.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the main goal of the “partitioning” technique in the Map Reduce framework?

A

One of the main goal in the design of efficient MapReduce
algorithms is to avoid that individual reducers process large
amounts of data. To achieve the goal, the relevant data should be
partitioned, either at random or deterministically, so to make
reducers work separately in individual partitions (which may require more rounds).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A typical way of partitioning data is by mapping each pair (K, V) to (x, (K, V)) where x is a RANDOM integer between [0, L) and then grouping by key; obtaining in expectation m = O(L) where m is the local space required by each reducer.

With what probability this is true?

A

P >= 1 - 1/N^5

Study the theorem on the slides!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a typical way of partitioning data if we have an RDD of indexed pairs (i, (K, V)) with i=0, …, n ?

A

By mapping each pair (i, (K, V)) to (i mod √N, (K, V)) with i=0, …, n and then grouping by key obtaining √N reducers each of size √N.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly