Week 6 - Data Management in MapReduce Systems Flashcards by Adam Cadiedux

What is Map Reduce (3 items)

1) distributed computing
2) Map
3) Reduce

(key/value pairs).

How well did you know this?

Not at all

Perfectly

Map Reduce - Map

Map takes a set of data and converts it into another set of data

How well did you know this?

Not at all

Perfectly

Map Reduce - Reduce

Reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples

How well did you know this?

Not at all

Perfectly

What is Hadoop

1) open-source framework
2) process big data
3) On clusters of computer
4) simple programming models.

How well did you know this?

Not at all

Perfectly

What are the two main phases of Map Reduce

1) Map Phase

2) Reduce Phase

How well did you know this?

Not at all

Perfectly

What does the master note/ master process do?

1) Keeps tract settle the cluster
2) Address local machines
3) Decides which machines run what for each phase

How well did you know this?

Not at all

Perfectly

Where does each worker store it’s results?

On it’s local disk

How well did you know this?

Not at all

Perfectly

What is the hidden phase between the Map phase and Reduce phase?

Data shuffle phase or data transfer phase (lot of data transfer)

How well did you know this?

Not at all

Perfectly

Can the Reduce phase happen while the Map phase is still going on?

No, the Mapper has to finish first

How well did you know this?

Not at all

Perfectly

Hadoop is what

An open source implication of the map reduce paradigm

How well did you know this?

Not at all

Perfectly

Hadoop file system is called

HDFS

Hadoop Distributed File System

How well did you know this?

Not at all

Perfectly

What does HDFS provide

1) Single name space for entire cluster

2) Replicated data 3x fault tolerance

How well did you know this?

Not at all

Perfectly

Two items for MapReduce Framework

1) Executes user jobs as “map” and “reduce” functions

2) Manages work and distributes & fault-tolerance

How well did you know this?

Not at all

Perfectly

HDFS is make up of these two elemets

1) NameNode

2) DataNode

How well did you know this?

Not at all

Perfectly

HDSF what size of blocks

128MB

How well did you know this?

Not at all

Perfectly

HDSF are block replicated

Study These Flashcards

Yes, over several DataNodes

HDSF Optimized for large or small files

Study These Flashcards

Large sequential reads

HDSF are file read and write?

Study These Flashcards

No append only

HDSF DataNodes are what

Study These Flashcards

Each Node is a machine

HDSF NameNode

Study These Flashcards

Stores the meta data about machines and locations

Centralized nameNode contains

Study These Flashcards

1) Filename
2) number of Replicas
3) Block-ids
4) More

What do you need to write a program in MapReduce

Study These Flashcards

1) Data type
2) Map Function
3) Reduce Function

MapReduce - Data type

Study These Flashcards

Key-value Records

MapReduce - Map Function

Study These Flashcards

(Key,1 VALUE1) ->

list(Key2, VALUE2)

MapReduce - Reduce Function

(Key2, list(VALUE2) -> | list(Key3, VALUE3)

MapReduce - Can Reduce run in parallel and independantly?

Yes But no sharing data, Map Functions work the same way

Should Mappers be placed on the same node or Rack as their input block?

Yes, it minimized network use

Where do the mappers save their output

Local disk (mainly for fault tolerance and recovery)

Advantage of storing Mapper output on local disks.

1) allows having more reducers than nodes | 2) Allows recovery if a reducer crashes

In Hadoop if a task crashes (map)

1) Retry on another node 2) OK for a map because it had no dependencies 3) OK for a reduce because map outputs are on disk

What if a machine(node) crashes

1) Re-launch its current tasks on other nodes (machines) | 2) Re-launch any maps the node ran previously (necessary because their output files were lost at the same time)

Do you always need the reduce phase?

No, sometimes the task is simple, and it takes advantage of parallelism (reading files in parallel)

Can you use projection in Hadoop?

Yes, you can only select a few columns if that is what you want (reduce phase might now be necessary)

two of the best operators in MapReduce

1) Sorting | 2) Group by

How to calculate statistics in Hadoop

1) reduce phase 2) Group by 3) count

What is an equi-join

An equijoin returns only the rows that have equivalent values for the specified columns.

Week 6 - Data Management in MapReduce Systems Flashcards

(36 cards)