Weeks 3-4: MapReduce Flashcards
Distributed File System (DFS)
This system stores large data, provides replication, and protects against failures. It’s used when files are huge (100’s of GB+), existing files are rarely updated, and reads and appends are common.
Examples: Google GFS, Hadoop HDFS, CloudStore
Software Stack
It uses a DFS and a programming model, which does large-scale, distributed computations efficiently, as well as tolerate hardware failures.
Chunk Servers
DFS’s divide files into chunks and store them in chunk servers, which are replicated on different machines or racks (to protect from server or rack failures).
Master Node
It’s a metadata file used to find chunks of files. The master node is also replicated. The directory for the file system is also replicated.
Client Library
It accesses files and communicates with the master node, as well as connects to chunk servers.
Hadoop HDFS
The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop.
HDFS Read Path
- Obtain the locations of the blocks. The client is requesting the block locations before reading the file.
- Check if the file exists.
- Check if the client has read permission.
- Sort the block (chunk) locations by distance to the client.
- The datanodes (chunk servers) stream data. If the server becomes unavailable, the client can read another replica of the file on a different server.
HDFS Write Path
- Create the file request.
- Check if the file exists.
- Check if the client has write permission.
- The outputstream object is created.
- Split the data into packets and add them into the queue.
- The namenode (master node) to allocate new blocks (chunks)
- Make connection with datanodes (chunk servers).
- Datanode 1 consumes data from the queue, writes to Datanode 2 (for replication), which sends acknowledgement to Datanode 1,…, Datanode 1 sends the data back to the client.
- The client closes the outputstream and sends the request to close the file. This is when the master node is updated.
MapReduce
This is a programming model for handling large datasets.
Workflow:
1. The input data is split into pieces and multiple instances of the programme begin.
2. One instance is elected as the master, and the rest are workers.
3. A worker with a map task processes its inputs, outputs key-value pairs, and passes them into the user’s map function. Pairs are buffered in the memory.
4. Buffered pairs are written to the local disk and partitioned into regions. The locations of pairs on the disk are passed to the master.
5. When a reducer worker is notified by the master about locations, it reads buffered pairs, and groups them by key (key, list-of-values).
6. The reducer worker passes the key and values into the user’s reducer function, whose output is appended to the final output file.
7. When all map and reduce tasks are complete, control returns to the user programme. The programme may use output as input to another MapReduce task.
Map Task
The purpose of this task is to extract something of interest. It obtains chunks from the DFS, converts them into key-value pairs according to the user-written code.
Shuffle and Sort Task
This is done by the system automatically. Data from all mappers are grouped by key, split among reducers, and sorted by key. Each reducer gets all values associated with the same key.
Reduce Task
This task aggregates, summarises, filters, or transforms the data. Each reducer obtains all values associated with the same key. Code tells what key-value pairs to create.
Map Function
Input: key-value pair
Output: a list of key-value pairs (can be empty)
One map function calls for every different key-value pairs.
Reduce Function
Input: key-value pair
Output: a list of key-value pairs (can be empty)
One reduce function called for every different keys.
WordCount
The objective of WordCount is to return key-value pairs, with the keys being words and the values being the counts of the words.
Input: text file
1. The text file is parsed, so that there is one key-value pair for each word.
2. The key-value pairs are grouped by key.
3. The key-value pairs are reduced, so that there’s one key-value pair for each unique word, with its corresponding count.
Note that the key-value pairs are ordered by how early the first instance of the word appears in the text file.