17. MapReduce and Spark Flashcards
MapReduce and Spark - where the initial data is stored?
a file system in which large files (TBs, PBs) are partitioned into smaller files called chunks (usually 64MB) and then distributed and replicated several times on different nodes for fault tolerance. This is called a distributed file system (DFS) which has had countless implementations since then from Google’s proprietary GFS to Hadoop’s open-source HDFS.
MapReduce - input, output, what should be programmed?
The data model in MapReduce works on files are bags (key, value) pairs. A MapReduce program takes in an input of a bag of (input_key, value) pairs and outputs a bag of (output_key, value) pairs (output_key is optional). The user must provide two stateless functions – map() and reduce() – to define how the input pairs will be transformed into the output pair domain.
MapReduce - describe the Map phase
The first part of MapReduce – the Map phase – applies a user-provided Map function in parallel to an input of (key1, value1) pairs and outputs a bag of (key2, value2) pairs. The types and values of key1, value1, key2, and value2 are independent of each other. The (key2, value2) pairs serve as intermediate tuples in the MapReduce process.
func map(key1, value1) -> bag((key2, value2))
MapReduce - describe the Reduce phase
The second part of MapReduce – the Reduce phase – groups all intermediate pairs outputted by the Map phase with the same key2 and processes each group into a single bag of output values defined by a user-provided Reduce function. This function is also run in parallel with a machine or worker thread processing a single group of pairs with the same key2. The (key3, value3) pairs in this function serve as the output tuples in the MapReduce process.
func reduce(key2, bag(value2)) -> bag((key3, value3))
The role of workers in MapReduce
MapReduce processes are split up and run by workers which are processes that execute one task at a time. With 1 worker per core, we can have multiple workers per node in our system.
Both the Map and Reduce phases can be split among different workers on different machines, with workers performing independent tasks in parallel.
The role of a leader in MapReduce
MapReduce designates one machine as the leader node. Its role is to accept new MapReduce tasks and assign them to workers later on.
The leader begins by partitioning the input file into M splits by key and assigning workers to M map tasks, keeping track of progress as workers perform their tasks.
Workers write their output to the local disk and partition their output into R regions. The leader then assigns workers to the R reduce tasks which then write the final output to disk once complete.
Main improvements of Spark vs. MapReduce
- Consists of multiple steps instead of 1 mapper + 1 reducer.
- Stores intermediate results in main memory.
- Resembles relational algebra more closely.
Spark - explain what is RDD
Spark’s data model consists of semi-structured data objects called Resilient Distributed Datasets (RDDs). These objects, which can be anything from key value pairs to objects of a certain type, are immutable datasets that can be distributed across multiple machines. For faster execution, RDDs are not written to disk in intermediate steps but are rather stored in main memory. Since this will lead to intermediate results being lost if a node crashes, each RDD’s lineage is tracked which can help recompute RDDs.
Spark - 2 types of operators
transformations and actions.
Actions are eager operators which means they are executed immediately when called. These include operators such as count, reduce, and save.
Transformations, on the other hand, are evaluated lazily, meaning they are not executed immediately but are rather recorded in the lineage log. An operator tree, similar to a relational algebra tree, is constructed in memory.