MapReduce/Spark Flashcards

Question 1

Q

MapReduce Intro

Answer

A

-Programming framework for big
data processing on distributed platforms.

contained
* A programming model
* An implementation tailored towards a cluster-based computing
environment.

Question 2

Q

MapReduce Features

Answer

A

-DataCentric View
-Inspired by functional programming (map,reduce functions)
-Messy details hidden

Question 3

Q

Typical cluster architecture

Answer

A

-Racks of 16-64 compute nodes connected by fast switches

-Distributed File System (DFS)
* Files divided into chunks
* Each chunk replicated in different
nodes to ensure fault-tolerance.

Question 4

Q

MapReduce-Hadoop-Spark

Answer

A

*Apache Hadoop: popular
implementation.
The Hadoop
Distributed File System is still widely used.
* Apache Spark :
It supports the
implementation of MapReduce computations, but provides a much
richer programming environment. It uses the HDFS.

Question 5

Q

MapReduce computation

Answer

A

Sequence of rounds

1 round transform sets of KEY-VALUE PAIRS into another set of K-V PAIRS through 2 phases.

Map phase: Specify map function and produce >= 0 k-v (intermendiate) pairs

Shuffle: Intermediate pairs grouped by key: foreach key k kreates list Lk of values of intermediate pairs with key k.

2.Reduce: foreach k, a reduce function is applied to (k,Lk) and produces >=0 KV pairs , output of round.

Question 6

Q

Why key-value pairs?

Answer

A

MapReduce is data-centric and focuses on data
trasformations which are independent of where data actually reside.

The
keys are needed as addresses to reach the objects, and as labels to define
the groups in the reduce phases.

Question 7

Q

Specification of a MapReduce algorithm

Answer

A

The input and output of the algorithm is clearly defined
The sequence of rounds executed for any given input instance
is unambiguously implied by the specification
For each round the following aspects are clear
input, intermediate and output sets of key-value pairs
functions applied in the map and reduce phases.
Meaningful values (or asymptotic) bounds for the key
performance indicators (defined later) can be derived.

Question 8

Q

Execution of a round on a distributed platform

Answer

A

-user program forked into a master process and several
executor processes. The master assigns map and
reduce tasks to t executors, and monitor their status.

-Input and output files reside on a DFS, while
intermediate data are stored on the executors’ local memories.

Map phase: each executor is assigned a subset of input pairs and
applies the map function to them sequentially, one at a time.

Shuffle: the system collects the intermediate pairs from the
executors’ local spaces and groups them by key.

Reduce phase: each executor is assigned a subset of (k, Lk ) pairs,
and applies the reduce function to them sequentially, one at a time

Question 9

Q

Performance analysis of a MapReduce algorithm

Answer

A

1.Number of rounds R.

Rough estimate of running time

2.Local space ML (amount of main memory at each executor)

maximum amount of main memory required a map or reduce function for storing the input
and any data structure needed by the invocation.

Aggregate space MA (Disk space needed in system capacity of HDFS)

maximum amount of (disk) space which is
occupied by the stored data at the beginning/end of a map or
reduce phase

Question 10

Q

Design Goals for MapReduce

Answer

A

Observation

For every problem solvable by a sequential algorithm in space S
there exists a 1-round MapReduce algorithm with
ML = MA = Θ (S): run the sequential algorithm on the whole
input with one reducer

impractical for very large inputs for the following reasons:
* A platform with very large main memory is needed.
* No parallelism is exploited.

GOALS:
* Few rounds (e.g., R = O (1));
* Sublinear local space (e.g., ML = O (|input|^e), with e < 1);
* Linear aggregate space (i.e., MA = O (|input|)), or only
slightly superlinear;
* Low complexity of each map or reduce function.

Confliction:
1-4 time efficency: low communication vs low computation
2-3 Space efficency: low main memory vs low data replication.

Question 11

Q

Pros and cons of MapReduce

Answer

A

Pro
Data-centric view.
Usability. blackbox
Portability/Adaptability. runs in different platforms
Cost- not so expensive machines and cloud provides support.

Cons
Running time is only coarsely captured by R.

Curse of the last reducer

not suitable for applications that require very high
performance.

Question 12

Q

Partitioning

Answer

A

A typical approach to attain the stated goal on local space is to

Subdivide the relevant data into small partitions either
deterministically or randomly ad
Make reduce functions work separately in individual partitions.

Observation. Partitioning may increase the number of rounds and, more
rarely, the aggregate space requirements. Hence, suitable tradeoffs must
be sought.

Deterministic: ML = O(max{N/l, l})

Nondeterministic : ML = O(max{m, l}).

m> N/l. N objects into l groups
m is number of intermediate pairs with random key x.

hope m ~~ N/l.

Question 13

Q

Fix l =√N and suppose that in Round 1 the keys assigned to
intermediate pairs independently and with uniform probability from
[0,√N]. Then, with probability at least 1− 1/N^5
m = O(√N)

Answer

A

Proof 53.

Question 14

Q

Why do we need a framework like Spark?

Answer

A

-pool resources of many machines in one platform.
-manage and coordinate big data jobs there.

Question 15

Q

Spark Application

Answer

A

Driver
1-Creates Spark Context : channel to access spark functionalities.
2-Distrivute tasks to executors.
3- Monitors status of execution.

Executros
1.Execute task assigned and report to driver

Cluster manager
1.Controls the physical machines.
2.Allocates resources to aplications

EXECUTION: when the application is run on a single machine (i.e., local
mode) both driver and executors run on that machine as threads. If
instead the application is run on a distributed platform (i.e., cluster
mode) driver and executors can run on different machine
MAP/REDUCE TASKS: if the application implements a MapReduce
computation, a task assigned to an executor typically comprises running a
map function on several key-value pairs or a reduce function on several
key-listOfValues pair.

Question 16

Q

Resilient Distributed Dataset RDD

Answer

Study These Flashcards

A

collection of
elements of the same type, partitioned and distributed across
several machines

RDDs are created
from data in stable storage (e.g., HDFS), or
from other RDDs, through transformations.
RDDs are immutable (i.e., read-only).
RDDs are materialized only when needed (lazy evaluation).
Spark maintains the lineage of each RDD, namely the
sequence of transformations that generate it, which enable to
materialize it or reconstruct it after a failure, starting from
data in stable storage.

Question 17

Q

RDD: Partitioning

Answer

Study These Flashcards

A

each RDD is broken into chunks called
partitions distributed among machines.

specify the number of partitions for each RDD

Partitions are created by default (using a HashPartitioner, based on
objects’ hash codes) or through a custom partitioner.

Typical number of partitions: is 2x/3x the number of cores, which
helps balancing the work.

Partitioning is exploited to enhance performance:

Spark creates map tasks so to make each executor apply the map
function on data from a locally stored partition (if possible).
To implement algorithms that require partitioning, the programmer
can explicitly access RDD partitions (e.g., mapPartition method).
RDD partitions are explioited implicitly by some ready-made Spark
aggregation primitives (e.g., reduceByKey method).

Question 18

Q

RDD: Operations

Answer

Study These Flashcards

A

1.TRANSFORMATIONS

A transformation generates a new RDD B
starting from the data in A.

Narrow: Partition of A contributes to one partition of B (map). No shuffling, parallelism.

Wide: Partition of A ,+many of B. Shuffling required. GroupByKey method.

2.ACTIONS
computation on the elements of A which
returns a value

LAZY EVALUATION: RDD A is materialized only when an action
is performed.

Persistence. Methods like persist or cache will save the RDD data in
memory after the subsequent action.

Question 19

Q

Implementing a MapReduce round in Spark

Answer

Study These Flashcards

A

Map Phase: on the input RDD X invoke one of map
methods offered by Spark (narrow transformation) passing the
desired map function as argument.
* Reduce Phase: on the RDD X
0
resulting from the Map
Phase:
* invoke one of grouping methods offered by Spark (wide
transformation) to group the key-value pairs into
key-ListOfValues pairs;
* invoke one of map methods offered by Spark to apply the
desired reduce function to each key-ListOfValues pair.

MapReduce/Spark Flashcards

(19 cards)