17. MapReduce and Spark Flashcards

1
Q

MapReduce and Spark - where the initial data is stored?

A

a file system in which large files (TBs, PBs) are partitioned into smaller files called chunks (usually 64MB) and then distributed and replicated several times on different nodes for fault tolerance. This is called a distributed file system (DFS) which has had countless implementations since then from Google’s proprietary GFS to Hadoop’s open-source HDFS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

MapReduce - input, output, what should be programmed?

A

The data model in MapReduce works on files are bags (key, value) pairs. A MapReduce program takes in an input of a bag of (input_key, value) pairs and outputs a bag of (output_key, value) pairs (output_key is optional). The user must provide two stateless functions – map() and reduce() – to define how the input pairs will be transformed into the output pair domain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

MapReduce - describe the Map phase

A

The first part of MapReduce – the Map phase – applies a user-provided Map function in parallel to an input of (key1, value1) pairs and outputs a bag of (key2, value2) pairs. The types and values of key1, value1, key2, and value2 are independent of each other. The (key2, value2) pairs serve as intermediate tuples in the MapReduce process.

func map(key1, value1) -> bag((key2, value2))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

MapReduce - describe the Reduce phase

A

The second part of MapReduce – the Reduce phase – groups all intermediate pairs outputted by the Map phase with the same key2 and processes each group into a single bag of output values defined by a user-provided Reduce function. This function is also run in parallel with a machine or worker thread processing a single group of pairs with the same key2. The (key3, value3) pairs in this function serve as the output tuples in the MapReduce process.

func reduce(key2, bag(value2)) -> bag((key3, value3))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The role of workers in MapReduce

A

MapReduce processes are split up and run by workers which are processes that execute one task at a time. With 1 worker per core, we can have multiple workers per node in our system.

Both the Map and Reduce phases can be split among different workers on different machines, with workers performing independent tasks in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The role of a leader in MapReduce

A

MapReduce designates one machine as the leader node. Its role is to accept new MapReduce tasks and assign them to workers later on.

The leader begins by partitioning the input file into M splits by key and assigning workers to M map tasks, keeping track of progress as workers perform their tasks.

Workers write their output to the local disk and partition their output into R regions. The leader then assigns workers to the R reduce tasks which then write the final output to disk once complete.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Main improvements of Spark vs. MapReduce

A
  • Consists of multiple steps instead of 1 mapper + 1 reducer.
  • Stores intermediate results in main memory.
  • Resembles relational algebra more closely.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Spark - explain what is RDD

A

Spark’s data model consists of semi-structured data objects called Resilient Distributed Datasets (RDDs). These objects, which can be anything from key value pairs to objects of a certain type, are immutable datasets that can be distributed across multiple machines. For faster execution, RDDs are not written to disk in intermediate steps but are rather stored in main memory. Since this will lead to intermediate results being lost if a node crashes, each RDD’s lineage is tracked which can help recompute RDDs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Spark - 2 types of operators

A

transformations and actions.

Actions are eager operators which means they are executed immediately when called. These include operators such as count, reduce, and save.

Transformations, on the other hand, are evaluated lazily, meaning they are not executed immediately but are rather recorded in the lineage log. An operator tree, similar to a relational algebra tree, is constructed in memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly