Lecture 5 Flashcards
1
Q
What is Map Reduce?
A
- is a data processing paradigm for condensing large amounts of data into useful aggregated results in AL library (Parallelisation, Fault Tolerance, Data Distribution and Load balancing)
2
Q
What is a programming model and library?
A
- Abstraction to express simple computation
- Library takes care of: Parallelisation, Fault Tolerance, Data Distribution and Load balancing
3
Q
What is Batch Processing?
A
- Referred to the process of large amounts of data inputs until a certain number have occurred/time which is stored together as a ‘batch’.
4
Q
What is GFS?
A
- File is divided into several chunks of predefined size
- The system replicates each chunk by a number
- To achieve fault-toleraence, availability and reliability
5
Q
What are the two types of failures?
A
- Worker failures:
- Sending heartbeat messages by the master – If no response within a certain time, then the work is dead
- In-progress and completed map tasks are re-scheduled
- Workers execute reduce tasks affected from failed map/workers are notified of re-scheduling
- Master failure:
- Rare
- Can be recovered using checkpoints
Solution: aborts MapReduce computation and starts again
6
Q
Why is Disk Locality good?
A
- Assumes disk bandwidth exceeds Network bandwidth to increase latency
- Goal – Save/Improve network bandwidth speeds
- Use of GFS that stores typically three copies of the data block on different machines
7
Q
Why is Task Granularity good?
A
- The more map tasks > the more worker nodes to complete tasks
- Better load balancing
- Better recovery
- But this, increases load on the master
- More scheduling
- More states to be saved
-
M chosen with respect to the block size of the system
- Locality properties
-
R is usually specified by users
- Each reduce tasks produces one output
8
Q
What are Stragglers?
A
Slow workers delay overall completion time
- Bad disks with soft errors
- Other tasks using up resources
- Machine configuration problems
Close to end of MapReduce operation, master schedules back up execution of the remain in-progress tasks
- A task is marked as complete whenever either the primary or the backup execution completes.
9
Q
What are the 7 Steps to Execution flow?
A
- MapReduce library splits the input into ‘M’ pieces to 16-64MB pieces. It is distributed across a cluster of machines.
- A special program, the ‘Masters’ assigns tasks to the Workers. The master assigns a specific worker to a map task or reduce task
- The worker assigned to task is reads the content of the input split. The Map function is placed in the buffered in memory.
- The buffered pair is written in the local disk + partitioned which is passed back to the masters. The masters is responsible to reduce the tasks.
- The workers are notified by the masters that the tasks have been reduced. The procedure calls are placed into the buffering memory and into the local disk to locate the workers. If the intermediate key/value is too large, it is placed out.
- The workers evaluate the intermediate key/value when encountered + it is partitioned.
- The final maps are completed: tasks + maps are reduced. The master WAKES UP to user program and retuned to user code.
10
Q
What are the two programs the programs wll be running on?
A
Two programs run on a large cluster + processing 1TB of data:
- Grep: search over 10(10), 100-byte records looking for rare 3 character pattern
- Sort: Sorts 10(10) 100-byte records