MapReduce/Spark Flashcards
MapReduce Intro
-Programming framework for big
data processing on distributed platforms.
contained
* A programming model
* An implementation tailored towards a cluster-based computing
environment.
MapReduce Features
-DataCentric View
-Inspired by functional programming (map,reduce functions)
-Messy details hidden
Typical cluster architecture
-Racks of 16-64 compute nodes connected by fast switches
-Distributed File System (DFS)
* Files divided into chunks
* Each chunk replicated in different
nodes to ensure fault-tolerance.
MapReduce-Hadoop-Spark
*Apache Hadoop: popular
implementation.
The Hadoop
Distributed File System is still widely used.
* Apache Spark :
It supports the
implementation of MapReduce computations, but provides a much
richer programming environment. It uses the HDFS.
MapReduce computation
Sequence of rounds
1 round transform sets of KEY-VALUE PAIRS into another set of K-V PAIRS through 2 phases.
- Map phase: Specify map function and produce >= 0 k-v (intermendiate) pairs
Shuffle: Intermediate pairs grouped by key: foreach key k kreates list Lk of values of intermediate pairs with key k.
2.Reduce: foreach k, a reduce function is applied to (k,Lk) and produces >=0 KV pairs , output of round.
Why key-value pairs?
MapReduce is data-centric and focuses on data
trasformations which are independent of where data actually reside.
The
keys are needed as addresses to reach the objects, and as labels to define
the groups in the reduce phases.
Specification of a MapReduce algorithm
- The input and output of the algorithm is clearly defined
- The sequence of rounds executed for any given input instance
is unambiguously implied by the specification - For each round the following aspects are clear
- input, intermediate and output sets of key-value pairs
- functions applied in the map and reduce phases.
- Meaningful values (or asymptotic) bounds for the key
performance indicators (defined later) can be derived.
Execution of a round on a distributed platform
-user program forked into a master process and several
executor processes. The master assigns map and
reduce tasks to t executors, and monitor their status.
-Input and output files reside on a DFS, while
intermediate data are stored on the executors’ local memories.
- Map phase: each executor is assigned a subset of input pairs and
applies the map function to them sequentially, one at a time.
Shuffle: the system collects the intermediate pairs from the
executors’ local spaces and groups them by key.
Reduce phase: each executor is assigned a subset of (k, Lk ) pairs,
and applies the reduce function to them sequentially, one at a time
Performance analysis of a MapReduce algorithm
1.Number of rounds R.
Rough estimate of running time
2.Local space ML (amount of main memory at each executor)
maximum amount of main memory required a map or reduce function for storing the input
and any data structure needed by the invocation.
- Aggregate space MA (Disk space needed in system capacity of HDFS)
maximum amount of (disk) space which is
occupied by the stored data at the beginning/end of a map or
reduce phase
Design Goals for MapReduce
Observation
For every problem solvable by a sequential algorithm in space S
there exists a 1-round MapReduce algorithm with
ML = MA = Θ (S): run the sequential algorithm on the whole
input with one reducer
impractical for very large inputs for the following reasons:
* A platform with very large main memory is needed.
* No parallelism is exploited.
GOALS:
* Few rounds (e.g., R = O (1));
* Sublinear local space (e.g., ML = O (|input|^e), with e < 1);
* Linear aggregate space (i.e., MA = O (|input|)), or only
slightly superlinear;
* Low complexity of each map or reduce function.
Confliction:
1-4 time efficency: low communication vs low computation
2-3 Space efficency: low main memory vs low data replication.
Pros and cons of MapReduce
Pro
Data-centric view.
Usability. blackbox
Portability/Adaptability. runs in different platforms
Cost- not so expensive machines and cloud provides support.
Cons
Running time is only coarsely captured by R.
Curse of the last reducer
not suitable for applications that require very high
performance.
Partitioning
A typical approach to attain the stated goal on local space is to
- Subdivide the relevant data into small partitions either
deterministically or randomly ad - Make reduce functions work separately in individual partitions.
Observation. Partitioning may increase the number of rounds and, more
rarely, the aggregate space requirements. Hence, suitable tradeoffs must
be sought.
Deterministic: ML = O(max{N/l, l})
Nondeterministic : ML = O(max{m, l}).
m> N/l. N objects into l groups
m is number of intermediate pairs with random key x.
hope m ~~ N/l.
Fix l =√N and suppose that in Round 1 the keys assigned to
intermediate pairs independently and with uniform probability from
[0,√N]. Then, with probability at least 1− 1/N^5
m = O(√N)
Proof 53.
Why do we need a framework like Spark?
-pool resources of many machines in one platform.
-manage and coordinate big data jobs there.
Spark Application
Driver
1-Creates Spark Context : channel to access spark functionalities.
2-Distrivute tasks to executors.
3- Monitors status of execution.
Executros
1.Execute task assigned and report to driver
Cluster manager
1.Controls the physical machines.
2.Allocates resources to aplications
EXECUTION: when the application is run on a single machine (i.e., local
mode) both driver and executors run on that machine as threads. If
instead the application is run on a distributed platform (i.e., cluster
mode) driver and executors can run on different machine
MAP/REDUCE TASKS: if the application implements a MapReduce
computation, a task assigned to an executor typically comprises running a
map function on several key-value pairs or a reduce function on several
key-listOfValues pair.