1. Obtain the locations of the blocks. The client is requesting the block locations before reading the file. 2. Check if the file exists. 3. Check if the client has read permission. 4. Sort the block (chunk) locations by distance to the client. 5. The datanodes (chunk servers) stream data. If the server becomes unavailable, the client can read another replica of the file on a different server.

1. Create the file request. 2. Check if the file exists. 3. Check if the client has write permission. 4. The outputstream object is created. 5. Split the data into packets and add them into the queue. 6. The namenode (master node) to allocate new blocks (chunks) 7. Make connection with datanodes (chunk servers). 8. Datanode 1 consumes data from the queue, writes to Datanode 2 (for replication), which sends acknowledgement to Datanode 1,..., Datanode 1 sends the data back to the client. 9. The client closes the outputstream and sends the request to close the file. This is when the master node is updated.

Weeks 3-4: MapReduce Flashcards by Henry Cao

Distributed File System (DFS)

This system stores large data, provides replication, and protects against failures. It’s used when files are huge (100’s of GB+), existing files are rarely updated, and reads and appends are common.

Examples: Google GFS, Hadoop HDFS, CloudStore

How well did you know this?

Not at all

Perfectly

Software Stack

It uses a DFS and a programming model, which does large-scale, distributed computations efficiently, as well as tolerate hardware failures.

How well did you know this?

Not at all

Perfectly

Chunk Servers

DFS’s divide files into chunks and store them in chunk servers, which are replicated on different machines or racks (to protect from server or rack failures).

How well did you know this?

Not at all

Perfectly

Master Node

It’s a metadata file used to find chunks of files. The master node is also replicated. The directory for the file system is also replicated.

How well did you know this?

Not at all

Perfectly

Client Library

It accesses files and communicates with the master node, as well as connects to chunk servers.

How well did you know this?

Not at all

Perfectly

Hadoop HDFS

The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop.

How well did you know this?

Not at all

Perfectly

HDFS Read Path

Obtain the locations of the blocks. The client is requesting the block locations before reading the file.
Check if the file exists.
Check if the client has read permission.
Sort the block (chunk) locations by distance to the client.
The datanodes (chunk servers) stream data. If the server becomes unavailable, the client can read another replica of the file on a different server.

How well did you know this?

Not at all

Perfectly

HDFS Write Path

Create the file request.
Check if the file exists.
Check if the client has write permission.
The outputstream object is created.
Split the data into packets and add them into the queue.
The namenode (master node) to allocate new blocks (chunks)
Make connection with datanodes (chunk servers).
Datanode 1 consumes data from the queue, writes to Datanode 2 (for replication), which sends acknowledgement to Datanode 1,…, Datanode 1 sends the data back to the client.
The client closes the outputstream and sends the request to close the file. This is when the master node is updated.

How well did you know this?

Not at all

Perfectly

MapReduce

This is a programming model for handling large datasets.

Workflow:
1. The input data is split into pieces and multiple instances of the programme begin.
2. One instance is elected as the master, and the rest are workers.
3. A worker with a map task processes its inputs, outputs key-value pairs, and passes them into the user’s map function. Pairs are buffered in the memory.
4. Buffered pairs are written to the local disk and partitioned into regions. The locations of pairs on the disk are passed to the master.
5. When a reducer worker is notified by the master about locations, it reads buffered pairs, and groups them by key (key, list-of-values).
6. The reducer worker passes the key and values into the user’s reducer function, whose output is appended to the final output file.
7. When all map and reduce tasks are complete, control returns to the user programme. The programme may use output as input to another MapReduce task.

How well did you know this?

Not at all

Perfectly

Map Task

The purpose of this task is to extract something of interest. It obtains chunks from the DFS, converts them into key-value pairs according to the user-written code.

How well did you know this?

Not at all

Perfectly

Shuffle and Sort Task

This is done by the system automatically. Data from all mappers are grouped by key, split among reducers, and sorted by key. Each reducer gets all values associated with the same key.

How well did you know this?

Not at all

Perfectly

Reduce Task

This task aggregates, summarises, filters, or transforms the data. Each reducer obtains all values associated with the same key. Code tells what key-value pairs to create.

How well did you know this?

Not at all

Perfectly

Map Function

Input: key-value pair
Output: a list of key-value pairs (can be empty)
One map function calls for every different key-value pairs.

How well did you know this?

Not at all

Perfectly

Reduce Function

Input: key-value pair
Output: a list of key-value pairs (can be empty)
One reduce function called for every different keys.

How well did you know this?

Not at all

Perfectly

WordCount

The objective of WordCount is to return key-value pairs, with the keys being words and the values being the counts of the words.

Input: text file
1. The text file is parsed, so that there is one key-value pair for each word.
2. The key-value pairs are grouped by key.
3. The key-value pairs are reduced, so that there’s one key-value pair for each unique word, with its corresponding count.

Note that the key-value pairs are ordered by how early the first instance of the word appears in the text file.

How well did you know this?

Not at all

Perfectly

Master Failure

Study These Flashcards

In this case, the MapReduce task is aborted and must restart.

Map Worker Failure

Study These Flashcards

The map tasks completed or in-progress at the worker are reset to idle. Reduce workers are notified when the task is rescheduled to another worker.

Reduce Worker Failure

Study These Flashcards

Only in-progress tasks are reset to idle. The reduce task is rescheduled to start later.

Map to Reduce Ratio

Study These Flashcards

Usually, there are many more map tasks than reduce tasks. This improves dynamic load balancing, speeds up recovery from worker failures, and can pipeline shuffling with map execution. Usually, the number of reduce tasks is a small multiple of the number of nodes, as output is spread across reduce files.

Backup Tasks

Study These Flashcards

This deals with the issue of slow workers, which may be caused by multiple jobs on the same machine, bad disks, and bugs. The solution is to spawn backup copies of in-progress tasks whenever a MapReduce operation is about to complete. Whichever backup copy that finishes first wins and is used. This dramatically shortens job completion time.

Combiners

Study These Flashcards

Often the map task will make many pairs from the same key (i.e. WordCount). Combiners save network time by pre-aggregating values in the mapper. It’s usually the same as the reduce function, but only works for functions that are both commutative and associative.

Partitioner

Study These Flashcards

This divides the data (key, value) pairs to Map tasks. Partitioners ensure that the same key, which can be output by multiple mappers, goes to the same reducer.

By default: hash(key) mod R

This aims to create “well-balanced” partitions. It’s sometimes useful to override the hash function.

Example: hash(hostname(URL)) mod R puts all URL’s from the host in the same output file.

mrjob

Study These Flashcards

This is a Python library specific for MapReduce jobs.

WordCount Code

Study These Flashcards

File: mr_word_count.py

from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
yield “chars”, len(line)
yield “words”, len(line.split())
yield “lines”, 1

def reducer(self, key, values):
yield key, sum(values)

if __name__ == ‘__main__’:
MRWordFrequencyCount.run()

Note that mapper has an empty key and a line as value, that mapper returns 3 key-value pairs, and for each key, returns the sum of its values.

Map-Reduce Join

If you need to join tuples of form (a,b) and (b,c) into the form (a,c), then the map process turns each input tuple R(a,b) into key-value pair (b,a(R)) and each input tuple S(b,c) into key-value pair (b,(c,S)).

Communication Cost

This cost measure sums up all the I/O of all processes. Often, communication is slower than computation, at least for simple algorithms. This is because simple computations still require lots of reading and writing of data. Communication cost = input file size + 2 x (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce processes

Elapsed Communication Cost

This cost measure finds the maximum of the I/O along any path.

Elapsed Computation Cost

This cost measure is similar to elapsed communication cost, but counts running time for processes as well. It functions as a wall-clock time using parallelism. Elapsed communication cost = the largest input + output (for any map process) + the largest input + output (for any reduce process)

Cost Measure

Either the I/O (communication) or the processing (computation) cost dominates. The total cost indicates how much cost there is to use a cloud service.

Key Complexity

This is a complexity measure for MapReduce. Over all key-value pairs, input to or output by any Mapper or Redicer, calculate the maximum size, maximum run time, and maximum memory used.

Sequential Complexity

This is a complexity measure where over all mappers and reducers, you sum up the size of all key-value pairs input and output by Mapper and Reducers, and the total running time for all Mappers and Reducers.

Master MapReduce Instance

This instance is established as master by election, with the rest of the MapReduce instances becoming workers. The master MapReduce Instance periodically pings workers. It also assigns reduce tasks to idle works.

MapReduce Drawbacks

1. It's slower than alternatives such as Spark. 2. It's based off Hadoop, which is inefficient because it writes to a distributed file system. 3. It's inefficient for applications that reuse intermediate results across multiple computations (i.e. PageRank, k-means clustering). 4. It's inefficient for multiple ad-hoc queries on the same data (this is common in data mining).

Weeks 3-4: MapReduce Flashcards

(33 cards)