2. Map Reduce Flashcards

1
Q

What is a distributed file system?

A

A long-term information storage which enables storing large amounts of info and access of multiple processes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How are files stored in a DFS?

A

Files are split into chunks and the chunks are stored separately. Typically, chunks are replicated and kept on different racks for fault tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the advantages of a DFS?

A
  1. Allows for data scalability
  2. Provides fault tolerance
  3. High concurrency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the cluster architecture for a DFS?

A

There are nodes made up of memory, a CPU, and a disk
Nodes are organized into racks
Several racks are linked by a switch to provide fault tolerance between racks
Several switches are linked by a backbone switch to provide fault tolerance between switches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the speeds of switches in a cluster architecture?

A

Rack switch has 1 Gbps between any pair of nodes in a rack
Backbone switch has 2-10 Gbps between racks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a commodity cluster?

A

Low cost distributed computers that allow for cluster architecture. They are less specialized but are affordable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some common failures in commodity clusters?

A
  1. Node failure
  2. Link failure
  3. Rack failure
  4. Two-node connection failure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can we solve the issue of network bottlenecks when using commodity clusters?

A
  1. Store files multiple times for readability
  2. Bring computation close to the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a big data programming model?

A

Programmability on top of distributed file systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the requirements of a big data programming model?

A

1, Must support big data operations: fast access, distribute computation to nodes
2. Handles fault tolerance: replicates partitions, recovers files when needed
3. Enables adding more racks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is map reduce?

A

A big data programming model that applies an operation to all elements (map) and then performs a summarizing operation on the elements (reduce)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the challenges of big data programming models that map reduce overcomes?

A
  1. Storing data redundantly on multiple nodes
  2. Moving computation close to data to minimize expensive movement
  3. Simple programming model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe how the map reduce algorithm performs the word count task

A
  1. Each map node has a chunk of a file.
  2. Each map node generates key value pairs of the form (word, 1)
  3. Data is sorted to reduce nodes by sending pairs with the same key to the same node
  4. Values for the same keys are added together in the reducer node to get the count
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is map reduce a bad tool for?

A
  1. Frequently changing data
  2. Dependent tasks
  3. Interactive analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the components of a distributed file system?

A
  1. Chunk servers
  2. Master node (Name node)
  3. Client library (Data node)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the master node (name node)?

A
  1. Stores metadata about where files are stored
  2. May be replicated for fault tolerance
  3. Used by processes to find files
17
Q

What is a client library (data node)?

A
  1. Talks to the master node to find chunk servers
  2. Connects directly to chunk servers to access the data
  3. Essentially the process that gets close to the data to do work
18
Q

What is a chunk server?

A

A node that stores replicated file chunks typically 16-64MB in size. Replicas of chunks should be kept in different racks

19
Q

What does the map reduce environment take care of?

A
  1. Partitioning input data
  2. Scheduling program execution across a set of machines
  3. Performing the group by key step
  4. Handling machine failures
  5. Managing required inter-machine communication
20
Q

Where is data stored during the map reduce process?

A
  1. Input and output data are stored on the distributed file system
  2. Intermediate results are stored on the local file system of the map and reduce nodes
21
Q

What are the possible states a task can be in?

A
  1. Idle tasks which get scheduled by the master as workers become available
  2. In-progress
  3. Completed
22
Q

What happens when a map task completes its work?

A
  1. Sends master node location and sizes of its intermediate files, one for each reducer
  2. Master pushes the info to reducers
23
Q

How are node failures detected?

A

The master node pings workers periodically

24
Q

How are failures of map nodes handled?

A
  1. Map tasks that were completed or in progress are reset to idle
  2. The idle tasks are eventually rescheduled on other workers
25
Q

How are failures of reduce nodes handled?

A
  1. Only in-progress tasks are reset to idle
  2. Idle reduce tasks are restarted on other workers
26
Q

How is failure of a master node handled?

A

The map reduce task is aborted and the client is notified

27
Q

What is the rule of thumb for how many map tasks to make?

A

Make it much larger than the number of nodes in the cluster. One DFS chunk per map is common