Week 6 - Data Management in MapReduce Systems Flashcards
What is Map Reduce (3 items)
1) distributed computing
2) Map
3) Reduce
(key/value pairs).
Map Reduce - Map
Map takes a set of data and converts it into another set of data
Map Reduce - Reduce
Reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples
What is Hadoop
1) open-source framework
2) process big data
3) On clusters of computer
4) simple programming models.
What are the two main phases of Map Reduce
1) Map Phase
2) Reduce Phase
What does the master note/ master process do?
1) Keeps tract settle the cluster
2) Address local machines
3) Decides which machines run what for each phase
Where does each worker store it’s results?
On it’s local disk
What is the hidden phase between the Map phase and Reduce phase?
Data shuffle phase or data transfer phase (lot of data transfer)
Can the Reduce phase happen while the Map phase is still going on?
No, the Mapper has to finish first
Hadoop is what
An open source implication of the map reduce paradigm
Hadoop file system is called
HDFS
Hadoop Distributed File System
What does HDFS provide
1) Single name space for entire cluster
2) Replicated data 3x fault tolerance
Two items for MapReduce Framework
1) Executes user jobs as “map” and “reduce” functions
2) Manages work and distributes & fault-tolerance
HDFS is make up of these two elemets
1) NameNode
2) DataNode
HDSF what size of blocks
128MB
HDSF are block replicated
Yes, over several DataNodes
HDSF Optimized for large or small files
Large sequential reads
HDSF are file read and write?
No append only
HDSF DataNodes are what
Each Node is a machine
HDSF NameNode
Stores the meta data about machines and locations
Centralized nameNode contains
1) Filename
2) number of Replicas
3) Block-ids
4) More
What do you need to write a program in MapReduce
1) Data type
2) Map Function
3) Reduce Function
MapReduce - Data type
Key-value Records
MapReduce - Map Function
(Key,1 VALUE1) ->
list(Key2, VALUE2)
MapReduce - Reduce Function
(Key2, list(VALUE2) ->
list(Key3, VALUE3)
MapReduce - Can Reduce run in parallel and independantly?
Yes
But no sharing data, Map Functions work the same way
Should Mappers be placed on the same node or Rack as their input block?
Yes, it minimized network use
Where do the mappers save their output
Local disk (mainly for fault tolerance and recovery)
Advantage of storing Mapper output on local disks.
1) allows having more reducers than nodes
2) Allows recovery if a reducer crashes
In Hadoop if a task crashes (map)
1) Retry on another node
2) OK for a map because it had no dependencies
3) OK for a reduce because map outputs are on disk
What if a machine(node) crashes
1) Re-launch its current tasks on other nodes (machines)
2) Re-launch any maps the node ran previously (necessary because their output files were lost at the same time)
Do you always need the reduce phase?
No, sometimes the task is simple, and it takes advantage of parallelism (reading files in parallel)
Can you use projection in Hadoop?
Yes, you can only select a few columns if that is what you want (reduce phase might now be necessary)
two of the best operators in MapReduce
1) Sorting
2) Group by
How to calculate statistics in Hadoop
1) reduce phase
2) Group by
3) count
What is an equi-join
An equijoin returns only the rows that have equivalent values for the specified columns.