MapReduce Flashcards
What is MapReduce?
MapReduce is a programming model and implementation for processing and generating large data sets, where users specify a map function to generate intermediate key/value pairs and a reduce function to merge intermediate values by key.
What is the goal of MapReduce?
The goal is to support analytical jobs over large datasets by iterating over records, extracting data of interest, aggregating results, and saving the output to a distributed file system.
What does the map function do?
The map function processes input data by applying a function to each element, emitting key-value pairs as intermediate output.
What does the reduce function do?
The reduce function aggregates all values associated with the same key, combining them to produce the final output.
What are the origins of MapReduce?
MapReduce has its roots in functional programming, where the map and reduce (or fold) functions are used for processing collections.
How is the input data split for MapReduce?
The input is divided into fixed-size splits, ideally matching the size of a DFS block for better load balancing and faster processing.
What happens during the shuffling phase?
The shuffling phase redistributes intermediate key-value pairs from map tasks to reduce tasks, ensuring all values for a given key are brought together.
What is the purpose of sorting in MapReduce?
Sorting organizes key-value pairs by key locally on each node during the shuffle phase to prepare for the reduce function’s aggregation step.
What is a combiner in MapReduce?
A combiner is a mini-reducer that performs local aggregation on map outputs to reduce the volume of data transferred during shuffling, minimizing network traffic.
When can a combiner be used?
A combiner can be used if the reduce function is associative and commutative, allowing some reduction to occur on the map side.
What is a partitioner in MapReduce?
A partitioner assigns each key-value pair from the map output to a specific reducer, based on a partitioning function (e.g., a hash function).
How is the number of reducers determined?
The number of reducers can be specified by the user or calculated by the system, with a balance between the number of available resources and job efficiency.
What is the Word Count problem in MapReduce?
It involves counting the number of occurrences of each word in a collection of documents using a map function to emit words with a value of 1 and a reduce function to sum counts for each word.
What is the purpose of a two-stage MapReduce job?
Complex MapReduce calculations can be broken down into multiple stages, where the output of one stage serves as input for the next, allowing for more advanced data processing workflows.
What challenges arise in partitioning MapReduce output?
Challenges include balancing data evenly across reducers and handling skewed distributions where some keys have significantly more values than others.