Weeks 3-4: MapReduce Flashcards

1
Q

Distributed File System (DFS)

A

This system stores large data, provides replication, and protects against failures. It’s used when files are huge (100’s of GB+), existing files are rarely updated, and reads and appends are common.

Examples: Google GFS, Hadoop HDFS, CloudStore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Software Stack

A

It uses a DFS and a programming model, which does large-scale, distributed computations efficiently, as well as tolerate hardware failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Chunk Servers

A

DFS’s divide files into chunks and store them in chunk servers, which are replicated on different machines or racks (to protect from server or rack failures).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Master Node

A

It’s a metadata file used to find chunks of files. The master node is also replicated. The directory for the file system is also replicated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Client Library

A

It accesses files and communicates with the master node, as well as connects to chunk servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hadoop HDFS

A

The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

HDFS Read Path

A
  1. Obtain the locations of the blocks. The client is requesting the block locations before reading the file.
  2. Check if the file exists.
  3. Check if the client has read permission.
  4. Sort the block (chunk) locations by distance to the client.
  5. The datanodes (chunk servers) stream data. If the server becomes unavailable, the client can read another replica of the file on a different server.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

HDFS Write Path

A
  1. Create the file request.
  2. Check if the file exists.
  3. Check if the client has write permission.
  4. The outputstream object is created.
  5. Split the data into packets and add them into the queue.
  6. The namenode (master node) to allocate new blocks (chunks)
  7. Make connection with datanodes (chunk servers).
  8. Datanode 1 consumes data from the queue, writes to Datanode 2 (for replication), which sends acknowledgement to Datanode 1,…, Datanode 1 sends the data back to the client.
  9. The client closes the outputstream and sends the request to close the file. This is when the master node is updated.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MapReduce

A

This is a programming model for handling large datasets.

Workflow:
1. The input data is split into pieces and multiple instances of the programme begin.
2. One instance is elected as the master, and the rest are workers.
3. A worker with a map task processes its inputs, outputs key-value pairs, and passes them into the user’s map function. Pairs are buffered in the memory.
4. Buffered pairs are written to the local disk and partitioned into regions. The locations of pairs on the disk are passed to the master.
5. When a reducer worker is notified by the master about locations, it reads buffered pairs, and groups them by key (key, list-of-values).
6. The reducer worker passes the key and values into the user’s reducer function, whose output is appended to the final output file.
7. When all map and reduce tasks are complete, control returns to the user programme. The programme may use output as input to another MapReduce task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Map Task

A

The purpose of this task is to extract something of interest. It obtains chunks from the DFS, converts them into key-value pairs according to the user-written code.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Shuffle and Sort Task

A

This is done by the system automatically. Data from all mappers are grouped by key, split among reducers, and sorted by key. Each reducer gets all values associated with the same key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Reduce Task

A

This task aggregates, summarises, filters, or transforms the data. Each reducer obtains all values associated with the same key. Code tells what key-value pairs to create.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Map Function

A

Input: key-value pair
Output: a list of key-value pairs (can be empty)
One map function calls for every different key-value pairs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Reduce Function

A

Input: key-value pair
Output: a list of key-value pairs (can be empty)
One reduce function called for every different keys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

WordCount

A

The objective of WordCount is to return key-value pairs, with the keys being words and the values being the counts of the words.

Input: text file
1. The text file is parsed, so that there is one key-value pair for each word.
2. The key-value pairs are grouped by key.
3. The key-value pairs are reduced, so that there’s one key-value pair for each unique word, with its corresponding count.

Note that the key-value pairs are ordered by how early the first instance of the word appears in the text file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Master Failure

A

In this case, the MapReduce task is aborted and must restart.

17
Q

Map Worker Failure

A

The map tasks completed or in-progress at the worker are reset to idle. Reduce workers are notified when the task is rescheduled to another worker.

18
Q

Reduce Worker Failure

A

Only in-progress tasks are reset to idle. The reduce task is rescheduled to start later.

19
Q

Map to Reduce Ratio

A

Usually, there are many more map tasks than reduce tasks. This improves dynamic load balancing, speeds up recovery from worker failures, and can pipeline shuffling with map execution. Usually, the number of reduce tasks is a small multiple of the number of nodes, as output is spread across reduce files.

20
Q

Backup Tasks

A

This deals with the issue of slow workers, which may be caused by multiple jobs on the same machine, bad disks, and bugs. The solution is to spawn backup copies of in-progress tasks whenever a MapReduce operation is about to complete. Whichever backup copy that finishes first wins and is used. This dramatically shortens job completion time.

21
Q

Combiners

A

Often the map task will make many pairs from the same key (i.e. WordCount). Combiners save network time by pre-aggregating values in the mapper. It’s usually the same as the reduce function, but only works for functions that are both commutative and associative.

22
Q

Partitioner

A

This divides the data (key, value) pairs to Map tasks. Partitioners ensure that the same key, which can be output by multiple mappers, goes to the same reducer.

By default: hash(key) mod R

This aims to create “well-balanced” partitions. It’s sometimes useful to override the hash function.

Example: hash(hostname(URL)) mod R puts all URL’s from the host in the same output file.

23
Q

mrjob

A

This is a Python library specific for MapReduce jobs.

24
Q

WordCount Code

A

File: mr_word_count.py

from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
yield “chars”, len(line)
yield “words”, len(line.split())
yield “lines”, 1

def reducer(self, key, values):
yield key, sum(values)

if __name__ == ‘__main__’:
MRWordFrequencyCount.run()

Note that mapper has an empty key and a line as value, that mapper returns 3 key-value pairs, and for each key, returns the sum of its values.

25
Q

Map-Reduce Join

A

If you need to join tuples of form (a,b) and (b,c) into the form (a,c), then the map process turns each input tuple R(a,b) into key-value pair (b,a(R)) and each input tuple S(b,c) into key-value pair (b,(c,S)).

26
Q

Communication Cost

A

This cost measure sums up all the I/O of all processes. Often, communication is slower than computation, at least for simple algorithms. This is because simple computations still require lots of reading and writing of data.

Communication cost = input file size + 2 x (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce processes

27
Q

Elapsed Communication Cost

A

This cost measure finds the maximum of the I/O along any path.

28
Q

Elapsed Computation Cost

A

This cost measure is similar to elapsed communication cost, but counts running time for processes as well. It functions as a wall-clock time using parallelism.

Elapsed communication cost = the largest input + output (for any map process) + the largest input + output (for any reduce process)

29
Q

Cost Measure

A

Either the I/O (communication) or the processing (computation) cost dominates. The total cost indicates how much cost there is to use a cloud service.

30
Q

Key Complexity

A

This is a complexity measure for MapReduce. Over all key-value pairs, input to or output by any Mapper or Redicer, calculate the maximum size, maximum run time, and maximum memory used.

31
Q

Sequential Complexity

A

This is a complexity measure where over all mappers and reducers, you sum up the size of all key-value pairs input and output by Mapper and Reducers, and the total running time for all Mappers and Reducers.

32
Q

Master MapReduce Instance

A

This instance is established as master by election, with the rest of the MapReduce instances becoming workers. The master MapReduce Instance periodically pings workers. It also assigns reduce tasks to idle works.

33
Q

MapReduce Drawbacks

A
  1. It’s slower than alternatives such as Spark.
  2. It’s based off Hadoop, which is inefficient because it writes to a distributed file system.
  3. It’s inefficient for applications that reuse intermediate results across multiple computations (i.e. PageRank, k-means clustering).
  4. It’s inefficient for multiple ad-hoc queries on the same data (this is common in data mining).