MapReduce - Week 7 Flashcards
Batch Processing
jobs that can run without end user interaction, or can be scheduled to run as resources permit
Examples of batch processing for data sets that build over time
Web crawling
Transaction logs, for analysing trends
Equipment logs, for predicting faults
Huge data sets that may need to be processed on parallel architectures
Who originally developed map reduce
What two functions make up map reduce?
map and reduce
MapReduce - map function definition
map(key1, value1) -> [(key2, value2)]
Given a key and a value, generates a collection of key value pairs
MapReduce - reduce function definition
reduce(key2, [value2]) -> [(key3,value3)]
given a key key2 output by map, and a collection of all the values value2 associated with that key, return a new collection of key-value pairs
Word count with map reduce - what do the two functions do?
Map takes a document, and returns a set of word counts for that document.
e.g.
“the map operation given…” -> {“the”: 1, “map”:1, …}
Reduce takes outputs from map and collates them into one thing
{“the”: [1,1], “map”:[1,1} -> {“the”: 2, “map”: 2}
MapReduce provider, extensions and competitors
Hadoop, …
Extensions: Cloudera
Competitors: Apache Spark
AWS EC2
Purchase of virtual machines of different capabilities, with different operating systems and for different periods
IaaS
AWS S3
Purchase of storage that is accessed through a simple file system style interface
IaaS
EMR (Elastic Map Reduce)
The ability to run scalable applications written using the map reduce programming model over EC2 and S3 infrastructure
PaaS
How is S3 used for AWS MapReduce?
The input to the map/reduce problem
The Jar that contains the program
The output from the execution of the program
Logging information
Use Map reduce or RDB for single batch tasks?
MapReduce, perhaps the effort of loading the data into a relational database isn’t worth it
Use Map Reduce or RDB for data using online transactional processing and analytical tasks?
Map reduce won’t help with the OLTP tasks.
A relational database is more flexible and may be able to handle both, though often different systems are used for OLTP and analytics to avoid contention for resources.
Use Map Reduce or RDB for data that needs fine-grained access control
MapReduce itself doesn’t provide much in the way of security - the hosting environment does that.
Certain relational databases will provide fine-grained access control