MapReduce Flashcards
What is MapReduce?
A programming paradigm (way of doing programming) that leverages the computational resources of our cluster.
What does distributive computing need?
Storage and computation.
What is MapReduce a programming model for and which languages can implement it?
Processing large data and any programming language can implement it.
What is the Map Phase?
1) Divide the data set into chunks.
2) Have a separate process work on each chunk.
What’s another name for chunks?
Input splits.
What’s another name for the process working on the chunks/input splits?
Mappers.
What are some qualities of a mapper?
1) Each mapper processes one record at a time.
2) Each mapper executes the same set of code on each record.
3) The output of the mapper will be a key-value pair.
What are some features of an input split?
1) Input split respects logical record boundaries.
2) An abstraction (a Java class that works behind the scenes with pointers to start and end locations within blocks)
What is a mapper?
A program that is invoked by the Hadoop framework once per every record in the input split. The output of the mapper should be a key-pair value.
ex. 10 records means the mapper will be executed 10 times.
What is the Reduce Phase?
The reducers work on the output of the mappers.
The output of the individual mappers are grouped by the key and passed to the reducer.
What is the shuffle phase?
The process in which the output of the mappers is transferred to the reducers.
What is the shuffle phase sort?
In the map phase, each key is assigned to a partition by a class called partitioner. Within each partition, the key-value pairs will be sorted by key.
What is the shuffle phase copy?
Once the key-value pairs are sorted, the key-value pairs are then copied to the appropriate reducer based on the partition they belong to.
One partition == one reducer.
What is Shuffle Merge?
The merging of different key-value pairs from different mappers to maintain the sort order.
The keys will be unique to each reducer.
What is the combiner?
Optional during the map phase, combiner is used to reduce the amount of data that is given to the reducer.
It acts as a mini-reducer that runs after the mapper and before the reducer.