2. Map Reduce Flashcards by Branden Wheeler

What is a distributed file system?

A long-term information storage which enables storing large amounts of info and access of multiple processes

How well did you know this?

Not at all

Perfectly

How are files stored in a DFS?

Files are split into chunks and the chunks are stored separately. Typically, chunks are replicated and kept on different racks for fault tolerance

How well did you know this?

Not at all

Perfectly

What are the advantages of a DFS?

Allows for data scalability
Provides fault tolerance
High concurrency

How well did you know this?

Not at all

Perfectly

What is the cluster architecture for a DFS?

There are nodes made up of memory, a CPU, and a disk
Nodes are organized into racks
Several racks are linked by a switch to provide fault tolerance between racks
Several switches are linked by a backbone switch to provide fault tolerance between switches

How well did you know this?

Not at all

Perfectly

What are the speeds of switches in a cluster architecture?

Rack switch has 1 Gbps between any pair of nodes in a rack
Backbone switch has 2-10 Gbps between racks

How well did you know this?

Not at all

Perfectly

What is a commodity cluster?

Low cost distributed computers that allow for cluster architecture. They are less specialized but are affordable

How well did you know this?

Not at all

Perfectly

What are some common failures in commodity clusters?

Node failure
Link failure
Rack failure
Two-node connection failure

How well did you know this?

Not at all

Perfectly

How can we solve the issue of network bottlenecks when using commodity clusters?

Store files multiple times for readability
Bring computation close to the data

How well did you know this?

Not at all

Perfectly

What is a big data programming model?

Programmability on top of distributed file systems

How well did you know this?

Not at all

Perfectly

What are the requirements of a big data programming model?

1, Must support big data operations: fast access, distribute computation to nodes
2. Handles fault tolerance: replicates partitions, recovers files when needed
3. Enables adding more racks

How well did you know this?

Not at all

Perfectly

What is map reduce?

A big data programming model that applies an operation to all elements (map) and then performs a summarizing operation on the elements (reduce)

How well did you know this?

Not at all

Perfectly

What are the challenges of big data programming models that map reduce overcomes?

Storing data redundantly on multiple nodes
Moving computation close to data to minimize expensive movement
Simple programming model

How well did you know this?

Not at all

Perfectly

Describe how the map reduce algorithm performs the word count task

Each map node has a chunk of a file.
Each map node generates key value pairs of the form (word, 1)
Data is sorted to reduce nodes by sending pairs with the same key to the same node
Values for the same keys are added together in the reducer node to get the count

How well did you know this?

Not at all

Perfectly

What is map reduce a bad tool for?