2. Map Reduce Flashcards
What is a distributed file system?
A long-term information storage which enables storing large amounts of info and access of multiple processes
How are files stored in a DFS?
Files are split into chunks and the chunks are stored separately. Typically, chunks are replicated and kept on different racks for fault tolerance
What are the advantages of a DFS?
- Allows for data scalability
- Provides fault tolerance
- High concurrency
What is the cluster architecture for a DFS?
There are nodes made up of memory, a CPU, and a disk
Nodes are organized into racks
Several racks are linked by a switch to provide fault tolerance between racks
Several switches are linked by a backbone switch to provide fault tolerance between switches
What are the speeds of switches in a cluster architecture?
Rack switch has 1 Gbps between any pair of nodes in a rack
Backbone switch has 2-10 Gbps between racks
What is a commodity cluster?
Low cost distributed computers that allow for cluster architecture. They are less specialized but are affordable
What are some common failures in commodity clusters?
- Node failure
- Link failure
- Rack failure
- Two-node connection failure
How can we solve the issue of network bottlenecks when using commodity clusters?
- Store files multiple times for readability
- Bring computation close to the data
What is a big data programming model?
Programmability on top of distributed file systems
What are the requirements of a big data programming model?
1, Must support big data operations: fast access, distribute computation to nodes
2. Handles fault tolerance: replicates partitions, recovers files when needed
3. Enables adding more racks
What is map reduce?
A big data programming model that applies an operation to all elements (map) and then performs a summarizing operation on the elements (reduce)
What are the challenges of big data programming models that map reduce overcomes?
- Storing data redundantly on multiple nodes
- Moving computation close to data to minimize expensive movement
- Simple programming model
Describe how the map reduce algorithm performs the word count task
- Each map node has a chunk of a file.
- Each map node generates key value pairs of the form (word, 1)
- Data is sorted to reduce nodes by sending pairs with the same key to the same node
- Values for the same keys are added together in the reducer node to get the count
What is map reduce a bad tool for?
- Frequently changing data
- Dependent tasks
- Interactive analysis
What are the components of a distributed file system?
- Chunk servers
- Master node (Name node)
- Client library (Data node)