Week 6 - Data Management in MapReduce Systems Flashcards

1
Q

What is Map Reduce (3 items)

A

1) distributed computing
2) Map
3) Reduce

(key/value pairs).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Map Reduce - Map

A

Map takes a set of data and converts it into another set of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Map Reduce - Reduce

A

Reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Hadoop

A

1) open-source framework
2) process big data
3) On clusters of computer
4) simple programming models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two main phases of Map Reduce

A

1) Map Phase

2) Reduce Phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the master note/ master process do?

A

1) Keeps tract settle the cluster
2) Address local machines
3) Decides which machines run what for each phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Where does each worker store it’s results?

A

On it’s local disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the hidden phase between the Map phase and Reduce phase?

A

Data shuffle phase or data transfer phase (lot of data transfer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Can the Reduce phase happen while the Map phase is still going on?

A

No, the Mapper has to finish first

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hadoop is what

A

An open source implication of the map reduce paradigm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hadoop file system is called

A

HDFS

Hadoop Distributed File System

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does HDFS provide

A

1) Single name space for entire cluster

2) Replicated data 3x fault tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Two items for MapReduce Framework

A

1) Executes user jobs as “map” and “reduce” functions

2) Manages work and distributes & fault-tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

HDFS is make up of these two elemets

A

1) NameNode

2) DataNode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

HDSF what size of blocks

A

128MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

HDSF are block replicated

A

Yes, over several DataNodes

17
Q

HDSF Optimized for large or small files

A

Large sequential reads

18
Q

HDSF are file read and write?

A

No append only

19
Q

HDSF DataNodes are what

A

Each Node is a machine

20
Q

HDSF NameNode

A

Stores the meta data about machines and locations

21
Q

Centralized nameNode contains

A

1) Filename
2) number of Replicas
3) Block-ids
4) More

22
Q

What do you need to write a program in MapReduce

A

1) Data type
2) Map Function
3) Reduce Function

23
Q

MapReduce - Data type

A

Key-value Records

24
Q

MapReduce - Map Function

A

(Key,1 VALUE1) ->

list(Key2, VALUE2)

25
Q

MapReduce - Reduce Function

A

(Key2, list(VALUE2) ->

list(Key3, VALUE3)

26
Q

MapReduce - Can Reduce run in parallel and independantly?

A

Yes

But no sharing data, Map Functions work the same way

27
Q

Should Mappers be placed on the same node or Rack as their input block?

A

Yes, it minimized network use

28
Q

Where do the mappers save their output

A

Local disk (mainly for fault tolerance and recovery)

29
Q

Advantage of storing Mapper output on local disks.

A

1) allows having more reducers than nodes

2) Allows recovery if a reducer crashes

30
Q

In Hadoop if a task crashes (map)

A

1) Retry on another node
2) OK for a map because it had no dependencies
3) OK for a reduce because map outputs are on disk

31
Q

What if a machine(node) crashes

A

1) Re-launch its current tasks on other nodes (machines)

2) Re-launch any maps the node ran previously (necessary because their output files were lost at the same time)

32
Q

Do you always need the reduce phase?

A

No, sometimes the task is simple, and it takes advantage of parallelism (reading files in parallel)

33
Q

Can you use projection in Hadoop?

A

Yes, you can only select a few columns if that is what you want (reduce phase might now be necessary)

34
Q

two of the best operators in MapReduce

A

1) Sorting

2) Group by

35
Q

How to calculate statistics in Hadoop

A

1) reduce phase
2) Group by
3) count

36
Q

What is an equi-join

A

An equijoin returns only the rows that have equivalent values for the specified columns.