Task for processing large amounts of data Split data Forward data and code to participant nodes Check node state to react to errors Retrieve and reorganize results Need: Simple large-scale data processing Inspiration from functional programming Google published MapReduce paper in 2004

Let’s assume a long list of records: imagine if... We can distribute the execution of map operations to multiple nodes We have a mechanism for bringing map results back together in the fold operation That’s MapReduce! (and Hadoop) Implicit parallelism: We can parallelize execution of map operations since they are isolated We can reorder folding if the fold function is commutative and associative Commutative -\> change order of operands: x\ y = y\ x Associative -\> change order of operations: (2+3)+4 = 2+(3+4)=9

Programming model & runtime system for processing large data-sets E.g., Google’s search algorithms (PageRank: 2005 – 200TB indexed) Goal: make it easy to use 1000s of CPUs and TBs of data Inspiration: Functional programming languages Programmer specifies only “what” System determines “how” Schedule, parallelism, locality, communication.. Ingredients: Automatic parallelization and distribution ( nodes , Fault-tolerance I/O scheduling Status and monitoring

Often a map task will produce many pairs of the form (k,v1), (k,v2), ... for the same key k E.g., popular words in Word Count Can save network time by pre-aggregating at mapper For associative ops. like sum, count, max Decreases size of intermediate data Example: local counting for word count: def combiner(key, values): output(key, sum(values))

Inputs to map tasks are created by contiguous splits of input file For reduce, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function e.g., hash(key) mod R Sometimes useful to override Balance the loads Specific requirement on which key value pairs should be in the same output files

Chapter 10: MapReduce Flashcards by Andreas Hein

Origins cont’d

Task for processing large amounts of data
- Split data
- Forward data and code to participant nodes
- Check node state to react to errors
- Retrieve and reorganize results
Need: Simple large-scale data processing
Inspiration from functional programming
Google published MapReduce paper in 2004

How well did you know this?

Not at all

Perfectly

From Lisp to MapReduce - important features

Two important concepts in functional programming
- Map: do something to everything in a list
- Fold: combine results of a list in some way

How well did you know this?

Not at all

Perfectly

Map

Map is a higher-order function (= takes one or more functions as arguments)
How map works:
- Function is applied to every element in a list
- Result is a new list

How well did you know this?

Not at all

Perfectly

Fold

Fold is also a higher-order function
How fold works:
- Accumulator (generalization of a counter) set to initial value
- Function applied to list element and the accumulator
- Result stored in the accumulator
- Repeated for every item in the list
- Result is the final value in the accumulator

How well did you know this?

Not at all

Perfectly

Lisp -> MapReduce

Let’s assume a long list of records: imagine if…
- We can distribute the execution of map operations to multiple nodes
- We have a mechanism for bringing map results back together in the fold operation
That’s MapReduce! (and Hadoop)
Implicit parallelism:
- We can parallelize execution of map operations since they are isolated
- We can reorder folding if the fold function is commutative and associative
  - Commutative -> change order of operands: x*y = y*x
  - Associative -> change order of operations: (2+3)+4 = 2+(3+4)=9

How well did you know this?

Not at all

Perfectly

MapReduce - basics

Programming model & runtime system for processing large data-sets
- E.g., Google’s search algorithms (PageRank: 2005 – 200TB indexed)
- Goal: make it easy to use 1000s of CPUs and TBs of data
Inspiration: Functional programming languages
- Programmer specifies only “what”
- System determines “how”
  - Schedule, parallelism, locality, communication..
Ingredients:
- Automatic parallelization and distribution ( # nodes ,
- Fault-tolerance
- I/O scheduling
- Status and monitoring

How well did you know this?

Not at all

Perfectly

Simplified View of Architecture

Input data is distributed to workers (i.e., available nodes)
Workers perform Data computation
Master coordinates worker selection & fail-over
Results stored in output data files

How well did you know this?

Not at all

Perfectly

MapReduce Programming Model

Input & output: each a set of key/value pairs
Programmer specifies two functions:
- map (in_key, in_value) -> list(out_key, intermediate_value)
  - Processes input key/ value pair
  - Produces set of intermediate pairs
- reduce (out_key, list(intermediate_value)) -> list(out_value)
  - Combines all intermediate values for a particular key
  - Produces a set of merged output values (usually just one)
- User also specifies I/O locations and tuning parameters
  - Partition (out_key,number of partitions) -> partition for out_key

How well did you know this?

Not at all

Perfectly

Map() function

Records from the data source (lines out of files, rows of a database, etc.) are fed into the map function as key-value pairs: e.g., (filename, line)
Map() produces one or more intermediate values along with an output key from the input

How well did you know this?

Not at all

Perfectly

Sort and schuffle

Map reduce framework
- Shuffles and sorts intermediate pairs based on key
- Assigns resulting streams to reducers

How well did you know this?

Not at all

Perfectly

Reduce()

After the map phase is over, all the intermediate values for a given output key are combined together into a list
Reduce() combines those intermediate values into one or more final values for that same output key
In practice, usually only one final value per key)

How well did you know this?

Not at all

Perfectly

MapReduce execution stages

Scheduling: assigns workers to map and reduce tasks
Data distribution: moves processes to data (Map)
Synchronization: gathers, sorts, and shuffles intermediate data (Reduce)
Errors and faults: detects worker failures and restarts

How well did you know this?

Not at all

Perfectly

MapReduce example: Word count

Word count: Count the frequency of word appearances in set of documents
MAP: Each map gets a document. Mapper generates key/value pairs each time the word appears (word,”1”)
- INPUT: (filename, fileContent), where FileName is the key and FileContent is the value
- OUTPUT: (word wordApparence), where word is key and wordApparence is value
REDUCE: Combines the values per key and computes the sum
- INPUT: (word,List) where word is the key and Listis the value
- OUTPUT: (word, sum) where word is the key and sumis the value

How well did you know this?

Not at all

Perfectly

Combiners

Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k
- E.g., popular words in Word Count
Can save network time by pre-aggregating at mapper
For associative ops. like sum, count, max
Decreases size of intermediate data
Example: local counting for word count:
def combiner(key, values): output(key, sum(values))

How well did you know this?

Not at all

Perfectly

Partition Function

Inputs to map tasks are created by contiguous splits of input file
For reduce, we need to ensure that records with the same
intermediate key end up at the same worker
System uses a default partition function e.g., hash(key) mod R
Sometimes useful to override
- Balance the loads
- Specific requirement on which key value pairs should be in the same output files

How well did you know this?

Not at all

Perfectly

Parallelism

Study These Flashcards

map() functions run in parallel, creating different intermediate values from different input data sets
reduce() functions also run in parallel, each working on a different output key
All values are processed independently
Bottleneck: reduce phase can’t start until map phase is completely finished.

Fault Tolerance & Optimizations

Study These Flashcards

In a machine with 1000s of nodes, failures are common
On worker failure:
- Detect failure via periodic heartbeats
- Re-execute completed and in-progress map tasks
- Re-execute in progress reduce tasks
- Task completion committed through master
Not dealing with master failures so far…
Optimization for fault-tolerance and load-balancing
- Slow workers significantly lengthen completion time
  - Due to other jobs on machines, disk with errors, caching issues, …
  - Other jobs consuming resources on machine
- Solution: Near end of phase, spawn backup copies of tasks
- Which ever one finishes first “wins”

Criticism

Study These Flashcards

Too low level
- Manual programming of per record manipulation
- As opposed to declarative model
Nothing new
- Map and reduce are classical Lisp or higher order functions
Low per node performance
- Due to replication and data transfer

MapReducable?

Study These Flashcards

One-iteration algorithms are perfect fits
Multiple-iteration algorithms are good fits
- But small shared data have to be synchronized across iterations (typically through filesystem)
Some algorithms are not good for MapReduce framework
- Those algorithms typically require large shared data with a lot of synchronization (many machine learning algorithms)
- Many alternatives: Iterative MapReduce Frameworks or Bulk Synchronous
- Processing Frameworks

One-iteration: Inverted index NN

Study These Flashcards

List of documents that contain a certain term – Basic structure of modern search engines
Map process (parser)
- Tokenize documents
- Create (term, document) pairs
Combine process
- Locally combine pairs with same term to postings lists
Reduce process (inverter)
- Combine all postings lists for a term

Iterative MapReduce

Study These Flashcards

Iterative algorithms: Repeat map and reduce jobs
Examples: PageRank, K-means, SVM and many more
In MapReduce, the only way to share data across jobs is stable storage -> slow !
Some proposed solutions include: Twister and Spark

Spark basics NN

Study These Flashcards

Up to 100x faster than Hadoop MapReduce
Several programming interfaces (Java, Scala, Python, R)
Powers a stack of useful libraries (MLlib, GprahX, Spark Streaming )
Fault tolerance (for crashes & stragglers)
Extremely well documented
Based on the concept of a resilient distributed dataset (RDD):
- Read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost
- Created by transforming data in stable storage using data flow operators (map, filter, group by, …)
- A handle to an RDD contains enough information to compute the RDD starting from data in reliable storage
- Can be cached across parallel operations

Bulk Synchronous Parallel (BSP) abstract computer

Study These Flashcards

BSP is a parallel programming model: Distributed-memory parallel computer
Global vision of a number of processor/memory pairs interconnected by a communication network
Offers several advantages over MapReduce and MPI
- Supports message passing paradigm style for application development
- Provides a flexible, simple, and easy-to-use small API
- Enables to perform better than MPI for communication-intensive applications: All communication actions considered as a unit
- Guarantees impossibility of deadlocks or collisions in the communication mechanisms: Synchronization is part of the programming model
- Predictable performance

BSP model NN

Study These Flashcards

Processor-memory pairs
Communications network that delivers messages in a point- to-point manner
Mechanism for the efficient barrier synchronization for all or a subset of the processes
There are no special combining, replicating, or broadcasting facilities

BSP programs NN

* BSP programs composed of supersteps * In each superstep, processors execute computation * Steps using locally stored data, and steps can send and receive messages * Processors synchronize at end of superstep (at which time all messages have been received)

What is MapReduce?

Programming model and runtime for processing large data-sets (in clusters).

How does MapReduce differ from message-passing interface (MPI) and remote procedure calls (RPC)?

It offers a higher layer of abstraction at the cost of generality.

What does MapReduce offer? Provide at least three bullet points

* Automatic parallelization and distribution * Fault-tolerance * I/O scheduling * Status and monitoring

Describe the different stages in the execution framework of MapReduce (4 bullet points).

* Scheduling: assigns workers to map and reduce tasks * Data distribution: moves processes to data (Map) * Synchronization: gathers, sorts, and shuffles intermediate data (Reduce) * Errors and faults: detects worker failures and restarts

What type of algorithms are not good for the MapReduce framework ?

Algorithms that typically require large shared info with a lot of synchronization, e.g., SVM.

Word count: Count the frequency of word apperances in set of ducuments.

**MAP**: Each map gets a document. The mapper generates many key/value pairs for the apperance of a word, i.e., (word, "1") * **Input**: (fileName, fileContent) , where fileName is the key and fileContent is the value * **Output**: (word, wordApparence)-pairs, where word is the key, and wordApparence is the value **REDUCE**: Combines thevalues for a key and computes the sum * **Input**: (word, List) * **Output**: (word, sum(wordApparence))

Search for a pattern: Data is a set of files containing lines of text. Output the file names that contain this pattern.

**MAP**: Given (filename, some text) and "pattern", if "text" matches "pattern" output (filename, \_) * **Input**: The whole file, line by line (filename, some text) and pattern * **Output**: (filename, \_)-pairs, where filename is the key and value is notrelevant **REDUCE**: Identity function. Data is not changed * **Input**: (filename, \_) * **Output**: (filename, \_)

The MapReduce formulation of K-means usually needs a driver or wrapper around the normal execution framework. What is the reason for this?

Because of the iterations required in the K-means procedure, we need to loop MapReduce. The MapReduce K-means algorithm can be formulated using the following components: * Driver or wrapper * Mapper * Combiner * Reducer

Consider that a single file contains the predefined cluster centers K and the data points are distributed in several files. Provide a short description defining the task performed by each component of MapReduce K-means and define their input and output.

* **Driver** or **wrapper** * Runs multiple iteration jobs using mapper+combiner+reducer * **Mapper** * Task: Assign each point to closest centroids * Configure: A single file containing cluster centers * Input: Input data points * Output: (data id, cluster id) * **Combiner** * Input:(data id, cluster id) * Output: (cluster id, (partial sum, number of points)) * **Reduces** * Task: Update each centroid with its new location * Input:(data id, cluster id) * Output: (cluster id, cluster centroid)

What characteristic of K-means could cause a large computation overhead?

Since we are looping MapReduce, this means constant access to the IO (Hard-drive), if the K-means requires a large amount of iterations this could be a major source for time delay.

Chapter 10: MapReduce Flashcards

(35 cards)