MapReduce - Week 7 Flashcards
Batch Processing
jobs that can run without end user interaction, or can be scheduled to run as resources permit
Examples of batch processing for data sets that build over time
Web crawling
Transaction logs, for analysing trends
Equipment logs, for predicting faults
Huge data sets that may need to be processed on parallel architectures
Who originally developed map reduce
What two functions make up map reduce?
map and reduce
MapReduce - map function definition
map(key1, value1) -> [(key2, value2)]
Given a key and a value, generates a collection of key value pairs
MapReduce - reduce function definition
reduce(key2, [value2]) -> [(key3,value3)]
given a key key2 output by map, and a collection of all the values value2 associated with that key, return a new collection of key-value pairs
Word count with map reduce - what do the two functions do?
Map takes a document, and returns a set of word counts for that document.
e.g.
“the map operation given…” -> {“the”: 1, “map”:1, …}
Reduce takes outputs from map and collates them into one thing
{“the”: [1,1], “map”:[1,1} -> {“the”: 2, “map”: 2}
MapReduce provider, extensions and competitors
Hadoop, …
Extensions: Cloudera
Competitors: Apache Spark
AWS EC2
Purchase of virtual machines of different capabilities, with different operating systems and for different periods
IaaS
AWS S3
Purchase of storage that is accessed through a simple file system style interface
IaaS
EMR (Elastic Map Reduce)
The ability to run scalable applications written using the map reduce programming model over EC2 and S3 infrastructure
PaaS
How is S3 used for AWS MapReduce?
The input to the map/reduce problem
The Jar that contains the program
The output from the execution of the program
Logging information
Use Map reduce or RDB for single batch tasks?
MapReduce, perhaps the effort of loading the data into a relational database isn’t worth it
Use Map Reduce or RDB for data using online transactional processing and analytical tasks?
Map reduce won’t help with the OLTP tasks.
A relational database is more flexible and may be able to handle both, though often different systems are used for OLTP and analytics to avoid contention for resources.
Use Map Reduce or RDB for data that needs fine-grained access control
MapReduce itself doesn’t provide much in the way of security - the hosting environment does that.
Certain relational databases will provide fine-grained access control
MapReduce - what is a job?
The unit of work to be performed (the data and the program)
Can consist of several map and reduce tasks
MapReduce - what is a split?
A part of the input (e.g. a 64mb filesystem block)
MapReduce - what is a task?
Map or reduce functions created and run for each split/partition
MapReduce - what is a task tracker?
Tracks the progress of each of the map or reduce tasks on a node
Keeps the job tracker informed of progress
MapReduce - what is a job tracker?
Coordinates the different tasks comprising a job
MapReduce - How many map tasks are created?
One for each split (each part of the input, e.g. a 64mb filesystem block)
MapReduce - What are the steps of the map function?
- The map task runs on the split, creating key-value pairs
- The output of the map is partitioned into groups, for sending to reduce functions, typically by hashing.
- The partitions are sorted by key
- The outputs are written to the local file system
- The task manager notifies the job manager that the task is completed
MapReduce - What are the steps of the reduce function?
- Relevant map partitions are copied to the associated reduce nodes
- Data from different maps is merged to produce the inputs for individual reduce operations
- The reduce task is run
- Outputs are written to the distributed filesystem
What are the MapReduce performance issues to look out for?
Memory usage - The space required by the code within map and reduce
Skew - The likelihood that data is not distributed evenly across reduce nodes
Intermediate result size - The amount of data that is produced by map or reduce compared to the size of their inputs
Write the Basket analysis average problem as MapReduce pseudocode
A list of the prices of the baskets a user has purchased, e.g.
001 - £26
002 - £30
001 - £40
002 - £35
A simple analysis involves getting the average a user has spent
Map receives a key (identifying the location of an input split) and a value (a section of memory, a list of the users and values)
It emits those user and value pairs
The reduce function gets a customer id and a list of basket values for that customer from the map function, it emits the average of these values with the key of the customer id it received
MapReduce - summarisation patterns
Summarization aims to send as little information to reducers as possible, e.g.
“hello hello hello” -> {“hello”, 3}
instead of
“hello hello hello” -> {“hello” 1, “hello” 1, “hello” 1}
To do this numerically the operation must be both:
associative - (axb)xc = ax(bxc)
commutative - axb = bxa
MapReduce - combiners
An optional reducer that is local to a map, and that summarises its results in some way
Must have inputs and outputs that are semantically compatible with the output of a map
In word count, the normal reducer can be used as a combiner, in this case it has the same effect as summarisation
MapReduce - inverted index pattern
supports the construction of an index from the contents of a document to the document
Map returns emits a key for each token, and a value of the document identifier
Reduce is application specific, might compute tf-idf or write to an index structure
MapReduce - Filtering pattern
Discard some of the information in the input
Sampling - producing representative examples
Top-k - choosing the best examples
Distinct - removing duplicates
For word count - may only be interested in counting occurrences of dictionary words
MapReduce - Join Pattern
Implements a relational join
In Hadoop a job can be associated with two mappers that read from different inputs; then the infrastructure does most of the join
Need to label tuples with their table table, so we can distinguish them in the reduce.
For full explanation see “Join Pattern” slide in the “Map Reduce Programming - Week 7” notes