2. Map Reduce Flashcards
What is a distributed file system?
A long-term information storage which enables storing large amounts of info and access of multiple processes
How are files stored in a DFS?
Files are split into chunks and the chunks are stored separately. Typically, chunks are replicated and kept on different racks for fault tolerance
What are the advantages of a DFS?
- Allows for data scalability
- Provides fault tolerance
- High concurrency
What is the cluster architecture for a DFS?
There are nodes made up of memory, a CPU, and a disk
Nodes are organized into racks
Several racks are linked by a switch to provide fault tolerance between racks
Several switches are linked by a backbone switch to provide fault tolerance between switches
What are the speeds of switches in a cluster architecture?
Rack switch has 1 Gbps between any pair of nodes in a rack
Backbone switch has 2-10 Gbps between racks
What is a commodity cluster?
Low cost distributed computers that allow for cluster architecture. They are less specialized but are affordable
What are some common failures in commodity clusters?
- Node failure
- Link failure
- Rack failure
- Two-node connection failure
How can we solve the issue of network bottlenecks when using commodity clusters?
- Store files multiple times for readability
- Bring computation close to the data
What is a big data programming model?
Programmability on top of distributed file systems
What are the requirements of a big data programming model?
1, Must support big data operations: fast access, distribute computation to nodes
2. Handles fault tolerance: replicates partitions, recovers files when needed
3. Enables adding more racks
What is map reduce?
A big data programming model that applies an operation to all elements (map) and then performs a summarizing operation on the elements (reduce)
What are the challenges of big data programming models that map reduce overcomes?
- Storing data redundantly on multiple nodes
- Moving computation close to data to minimize expensive movement
- Simple programming model
Describe how the map reduce algorithm performs the word count task
- Each map node has a chunk of a file.
- Each map node generates key value pairs of the form (word, 1)
- Data is sorted to reduce nodes by sending pairs with the same key to the same node
- Values for the same keys are added together in the reducer node to get the count
What is map reduce a bad tool for?
- Frequently changing data
- Dependent tasks
- Interactive analysis
What are the components of a distributed file system?
- Chunk servers
- Master node (Name node)
- Client library (Data node)
What is the master node (name node)?
- Stores metadata about where files are stored
- May be replicated for fault tolerance
- Used by processes to find files
What is a client library (data node)?
- Talks to the master node to find chunk servers
- Connects directly to chunk servers to access the data
- Essentially the process that gets close to the data to do work
What is a chunk server?
A node that stores replicated file chunks typically 16-64MB in size. Replicas of chunks should be kept in different racks
What does the map reduce environment take care of?
- Partitioning input data
- Scheduling program execution across a set of machines
- Performing the group by key step
- Handling machine failures
- Managing required inter-machine communication
Where is data stored during the map reduce process?
- Input and output data are stored on the distributed file system
- Intermediate results are stored on the local file system of the map and reduce nodes
What are the possible states a task can be in?
- Idle tasks which get scheduled by the master as workers become available
- In-progress
- Completed
What happens when a map task completes its work?
- Sends master node location and sizes of its intermediate files, one for each reducer
- Master pushes the info to reducers
How are node failures detected?
The master node pings workers periodically
How are failures of map nodes handled?
- Map tasks that were completed or in progress are reset to idle
- The idle tasks are eventually rescheduled on other workers
How are failures of reduce nodes handled?
- Only in-progress tasks are reset to idle
- Idle reduce tasks are restarted on other workers
How is failure of a master node handled?
The map reduce task is aborted and the client is notified
What is the rule of thumb for how many map tasks to make?
Make it much larger than the number of nodes in the cluster. One DFS chunk per map is common