2. Map Reduce Flashcards
What is a distributed file system?
A long-term information storage which enables storing large amounts of info and access of multiple processes
How are files stored in a DFS?
Files are split into chunks and the chunks are stored separately. Typically, chunks are replicated and kept on different racks for fault tolerance
What are the advantages of a DFS?
- Allows for data scalability
- Provides fault tolerance
- High concurrency
What is the cluster architecture for a DFS?
There are nodes made up of memory, a CPU, and a disk
Nodes are organized into racks
Several racks are linked by a switch to provide fault tolerance between racks
Several switches are linked by a backbone switch to provide fault tolerance between switches
What are the speeds of switches in a cluster architecture?
Rack switch has 1 Gbps between any pair of nodes in a rack
Backbone switch has 2-10 Gbps between racks
What is a commodity cluster?
Low cost distributed computers that allow for cluster architecture. They are less specialized but are affordable
What are some common failures in commodity clusters?
- Node failure
- Link failure
- Rack failure
- Two-node connection failure
How can we solve the issue of network bottlenecks when using commodity clusters?
- Store files multiple times for readability
- Bring computation close to the data
What is a big data programming model?
Programmability on top of distributed file systems
What are the requirements of a big data programming model?
1, Must support big data operations: fast access, distribute computation to nodes
2. Handles fault tolerance: replicates partitions, recovers files when needed
3. Enables adding more racks
What is map reduce?
A big data programming model that applies an operation to all elements (map) and then performs a summarizing operation on the elements (reduce)
What are the challenges of big data programming models that map reduce overcomes?
- Storing data redundantly on multiple nodes
- Moving computation close to data to minimize expensive movement
- Simple programming model
Describe how the map reduce algorithm performs the word count task
- Each map node has a chunk of a file.
- Each map node generates key value pairs of the form (word, 1)
- Data is sorted to reduce nodes by sending pairs with the same key to the same node
- Values for the same keys are added together in the reducer node to get the count
What is map reduce a bad tool for?
- Frequently changing data
- Dependent tasks
- Interactive analysis
What are the components of a distributed file system?
- Chunk servers
- Master node (Name node)
- Client library (Data node)
What is the master node (name node)?
- Stores metadata about where files are stored
- May be replicated for fault tolerance
- Used by processes to find files
What is a client library (data node)?
- Talks to the master node to find chunk servers
- Connects directly to chunk servers to access the data
- Essentially the process that gets close to the data to do work
What is a chunk server?
A node that stores replicated file chunks typically 16-64MB in size. Replicas of chunks should be kept in different racks
What does the map reduce environment take care of?
- Partitioning input data
- Scheduling program execution across a set of machines
- Performing the group by key step
- Handling machine failures
- Managing required inter-machine communication
Where is data stored during the map reduce process?
- Input and output data are stored on the distributed file system
- Intermediate results are stored on the local file system of the map and reduce nodes
What are the possible states a task can be in?
- Idle tasks which get scheduled by the master as workers become available
- In-progress
- Completed
What happens when a map task completes its work?
- Sends master node location and sizes of its intermediate files, one for each reducer
- Master pushes the info to reducers
How are node failures detected?
The master node pings workers periodically
How are failures of map nodes handled?
- Map tasks that were completed or in progress are reset to idle
- The idle tasks are eventually rescheduled on other workers