Week 6 - Data Management in MapReduce Systems Flashcards
What is Map Reduce (3 items)
1) distributed computing
2) Map
3) Reduce
(key/value pairs).
Map Reduce - Map
Map takes a set of data and converts it into another set of data
Map Reduce - Reduce
Reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples
What is Hadoop
1) open-source framework
2) process big data
3) On clusters of computer
4) simple programming models.
What are the two main phases of Map Reduce
1) Map Phase
2) Reduce Phase
What does the master note/ master process do?
1) Keeps tract settle the cluster
2) Address local machines
3) Decides which machines run what for each phase
Where does each worker store it’s results?
On it’s local disk
What is the hidden phase between the Map phase and Reduce phase?
Data shuffle phase or data transfer phase (lot of data transfer)
Can the Reduce phase happen while the Map phase is still going on?
No, the Mapper has to finish first
Hadoop is what
An open source implication of the map reduce paradigm
Hadoop file system is called
HDFS
Hadoop Distributed File System
What does HDFS provide
1) Single name space for entire cluster
2) Replicated data 3x fault tolerance
Two items for MapReduce Framework
1) Executes user jobs as “map” and “reduce” functions
2) Manages work and distributes & fault-tolerance
HDFS is make up of these two elemets
1) NameNode
2) DataNode
HDSF what size of blocks
128MB