MapReduce/Spark Flashcards

1
Q

MapReduce Intro

A

-Programming framework for big
data processing on distributed platforms.

contained
* A programming model
* An implementation tailored towards a cluster-based computing
environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

MapReduce Features

A

-DataCentric View
-Inspired by functional programming (map,reduce functions)
-Messy details hidden

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Typical cluster architecture

A

-Racks of 16-64 compute nodes connected by fast switches

-Distributed File System (DFS)
* Files divided into chunks
* Each chunk replicated in different
nodes to ensure fault-tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

MapReduce-Hadoop-Spark

A

*Apache Hadoop: popular
implementation.
The Hadoop
Distributed File System is still widely used.
* Apache Spark :
It supports the
implementation of MapReduce computations, but provides a much
richer programming environment. It uses the HDFS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

MapReduce computation

A

Sequence of rounds

1 round transform sets of KEY-VALUE PAIRS into another set of K-V PAIRS through 2 phases.

  1. Map phase: Specify map function and produce >= 0 k-v (intermendiate) pairs

Shuffle: Intermediate pairs grouped by key: foreach key k kreates list Lk of values of intermediate pairs with key k.

2.Reduce: foreach k, a reduce function is applied to (k,Lk) and produces >=0 KV pairs , output of round.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why key-value pairs?

A

MapReduce is data-centric and focuses on data
trasformations which are independent of where data actually reside.

The
keys are needed as addresses to reach the objects, and as labels to define
the groups in the reduce phases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Specification of a MapReduce algorithm

A
  • The input and output of the algorithm is clearly defined
  • The sequence of rounds executed for any given input instance
    is unambiguously implied by the specification
  • For each round the following aspects are clear
  • input, intermediate and output sets of key-value pairs
  • functions applied in the map and reduce phases.
  • Meaningful values (or asymptotic) bounds for the key
    performance indicators (defined later) can be derived.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Execution of a round on a distributed platform

A

-user program forked into a master process and several
executor processes. The master assigns map and
reduce tasks to t executors, and monitor their status.

-Input and output files reside on a DFS, while
intermediate data are stored on the executors’ local memories.

  • Map phase: each executor is assigned a subset of input pairs and
    applies the map function to them sequentially, one at a time.

Shuffle: the system collects the intermediate pairs from the
executors’ local spaces and groups them by key.

Reduce phase: each executor is assigned a subset of (k, Lk ) pairs,
and applies the reduce function to them sequentially, one at a time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Performance analysis of a MapReduce algorithm

A

1.Number of rounds R.

Rough estimate of running time

2.Local space ML (amount of main memory at each executor)

maximum amount of main memory required a map or reduce function for storing the input
and any data structure needed by the invocation.

  1. Aggregate space MA (Disk space needed in system capacity of HDFS)

maximum amount of (disk) space which is
occupied by the stored data at the beginning/end of a map or
reduce phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Design Goals for MapReduce

A

Observation

For every problem solvable by a sequential algorithm in space S
there exists a 1-round MapReduce algorithm with
ML = MA = Θ (S): run the sequential algorithm on the whole
input with one reducer

impractical for very large inputs for the following reasons:
* A platform with very large main memory is needed.
* No parallelism is exploited.

GOALS:
* Few rounds (e.g., R = O (1));
* Sublinear local space (e.g., ML = O (|input|^e), with e < 1);
* Linear aggregate space (i.e., MA = O (|input|)), or only
slightly superlinear;
* Low complexity of each map or reduce function.

Confliction:
1-4 time efficency: low communication vs low computation
2-3 Space efficency: low main memory vs low data replication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pros and cons of MapReduce

A

Pro
Data-centric view.
Usability. blackbox
Portability/Adaptability. runs in different platforms
Cost- not so expensive machines and cloud provides support.

Cons
Running time is only coarsely captured by R.

Curse of the last reducer

not suitable for applications that require very high
performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Partitioning

A

A typical approach to attain the stated goal on local space is to

  • Subdivide the relevant data into small partitions either
    deterministically or randomly ad
  • Make reduce functions work separately in individual partitions.

Observation. Partitioning may increase the number of rounds and, more
rarely, the aggregate space requirements. Hence, suitable tradeoffs must
be sought.

Deterministic: ML = O(max{N/l, l})

Nondeterministic : ML = O(max{m, l}).

m> N/l. N objects into l groups
m is number of intermediate pairs with random key x.

hope m ~~ N/l.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Fix l =√N and suppose that in Round 1 the keys assigned to
intermediate pairs independently and with uniform probability from
[0,√N]. Then, with probability at least 1− 1/N^5
m = O(√N)

A

Proof 53.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do we need a framework like Spark?

A

-pool resources of many machines in one platform.
-manage and coordinate big data jobs there.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Spark Application

A

Driver
1-Creates Spark Context : channel to access spark functionalities.
2-Distrivute tasks to executors.
3- Monitors status of execution.

Executros
1.Execute task assigned and report to driver

Cluster manager
1.Controls the physical machines.
2.Allocates resources to aplications

EXECUTION: when the application is run on a single machine (i.e., local
mode) both driver and executors run on that machine as threads. If
instead the application is run on a distributed platform (i.e., cluster
mode) driver and executors can run on different machine
MAP/REDUCE TASKS: if the application implements a MapReduce
computation, a task assigned to an executor typically comprises running a
map function on several key-value pairs or a reduce function on several
key-listOfValues pair.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Resilient Distributed Dataset RDD

A

collection of
elements of the same type, partitioned and distributed across
several machines

  • RDDs are created
  • from data in stable storage (e.g., HDFS), or
  • from other RDDs, through transformations.
  • RDDs are immutable (i.e., read-only).
  • RDDs are materialized only when needed (lazy evaluation).
  • Spark maintains the lineage of each RDD, namely the
    sequence of transformations that generate it, which enable to
    materialize it or reconstruct it after a failure, starting from
    data in stable storage.
17
Q

RDD: Partitioning

A

each RDD is broken into chunks called
partitions distributed among machines.

specify the number of partitions for each RDD

Partitions are created by default (using a HashPartitioner, based on
objects’ hash codes) or through a custom partitioner.

Typical number of partitions: is 2x/3x the number of cores, which
helps balancing the work.

Partitioning is exploited to enhance performance:

  • Spark creates map tasks so to make each executor apply the map
    function on data from a locally stored partition (if possible).
  • To implement algorithms that require partitioning, the programmer
    can explicitly access RDD partitions (e.g., mapPartition method).
  • RDD partitions are explioited implicitly by some ready-made Spark
    aggregation primitives (e.g., reduceByKey method).
18
Q

RDD: Operations

A

1.TRANSFORMATIONS

A transformation generates a new RDD B
starting from the data in A.

Narrow: Partition of A contributes to one partition of B (map). No shuffling, parallelism.

Wide: Partition of A ,+many of B. Shuffling required. GroupByKey method.

2.ACTIONS
computation on the elements of A which
returns a value

LAZY EVALUATION: RDD A is materialized only when an action
is performed.

Persistence. Methods like persist or cache will save the RDD data in
memory after the subsequent action.

19
Q

Implementing a MapReduce round in Spark

A

Map Phase: on the input RDD X invoke one of map
methods offered by Spark (narrow transformation) passing the
desired map function as argument.
* Reduce Phase: on the RDD X
0
resulting from the Map
Phase:
* invoke one of grouping methods offered by Spark (wide
transformation) to group the key-value pairs into
key-ListOfValues pairs;
* invoke one of map methods offered by Spark to apply the
desired reduce function to each key-ListOfValues pair.