MapReduce Flashcards

Question 1

Q

What is MapReduce?

Answer

A

MapReduce is a programming model and implementation for processing and generating large data sets, where users specify a map function to generate intermediate key/value pairs and a reduce function to merge intermediate values by key.

Question 2

Q

What is the goal of MapReduce?

Answer

A

The goal is to support analytical jobs over large datasets by iterating over records, extracting data of interest, aggregating results, and saving the output to a distributed file system.

Question 3

Q

What does the map function do?

Answer

A

The map function processes input data by applying a function to each element, emitting key-value pairs as intermediate output.

Question 4

Q

What does the reduce function do?

Answer

A

The reduce function aggregates all values associated with the same key, combining them to produce the final output.

Question 5

Q

What are the origins of MapReduce?

Answer

A

MapReduce has its roots in functional programming, where the map and reduce (or fold) functions are used for processing collections.

Question 6

Q

How is the input data split for MapReduce?

Answer

A

The input is divided into fixed-size splits, ideally matching the size of a DFS block for better load balancing and faster processing.

Question 7

Q

What happens during the shuffling phase?

Answer

A

The shuffling phase redistributes intermediate key-value pairs from map tasks to reduce tasks, ensuring all values for a given key are brought together.

Question 8

Q

What is the purpose of sorting in MapReduce?

Answer

A

Sorting organizes key-value pairs by key locally on each node during the shuffle phase to prepare for the reduce function’s aggregation step.

Question 9

Q

What is a combiner in MapReduce?

Answer

A

A combiner is a mini-reducer that performs local aggregation on map outputs to reduce the volume of data transferred during shuffling, minimizing network traffic.

Question 10

Q

When can a combiner be used?

Answer

A

A combiner can be used if the reduce function is associative and commutative, allowing some reduction to occur on the map side.

Question 11

Q

What is a partitioner in MapReduce?

Answer

A

A partitioner assigns each key-value pair from the map output to a specific reducer, based on a partitioning function (e.g., a hash function).

Question 12

Q

How is the number of reducers determined?

Answer

A

The number of reducers can be specified by the user or calculated by the system, with a balance between the number of available resources and job efficiency.

Question 13

Q

What is the Word Count problem in MapReduce?

Answer

A

It involves counting the number of occurrences of each word in a collection of documents using a map function to emit words with a value of 1 and a reduce function to sum counts for each word.

Question 14

Q

What is the purpose of a two-stage MapReduce job?

Answer

A

Complex MapReduce calculations can be broken down into multiple stages, where the output of one stage serves as input for the next, allowing for more advanced data processing workflows.

Question 15

Q

What challenges arise in partitioning MapReduce output?

Answer

A

Challenges include balancing data evenly across reducers and handling skewed distributions where some keys have significantly more values than others.

Question 16

Q

What is non-commutativity in MapReduce?

Answer

Study These Flashcards

A

Non-commutativity means that the order of applying the reduce function affects the result.

Question 17

Q

What is non-associativity in MapReduce?

Answer

Study These Flashcards

A

Non-associativity occurs when the order of applying the reduce function affects the result, such as in cases involving complex calculations like averages.

Question 18

Q

How do combiners optimize MapReduce?

Answer

Study These Flashcards

A

Combiners reduce the size of intermediate data by pre-aggregating on the map side, thus decreasing the amount of data sent during the shuffle phase.

Question 19

Q

What is the general pattern for a summarization algorithm in MapReduce?

Answer

Study These Flashcards

A

The map function emits group criteria with values, while the reduce function aggregates these values (e.g., finding the maximum, sum, or average).

Question 20

Q

What is the join pattern in MapReduce?

Answer

Study These Flashcards

A

The join pattern involves emitting key-value pairs with flags to distinguish different datasets in the map phase and then matching records by key in the reduce phase to produce combined results.

MapReduce Flashcards

(20 cards)