MapReduce Flashcards
How MapReduce works?
a
How the BSP superstep works?
a
What’s wrong with MapReduce for Graphs?
a
Name two uses of dataflow that you learned from the course (we covered three).
a. Pig Latin, for MapReduce tasks
b. TensorFlow, for neural network architectures
Data can also be processed in multiple stages (pipelining). Name two architectures we covered, that permit such multistage data analysis.
a. YARN, ie. MapReduce v2
b. BSP
Motivaion of MapReducer
- This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Example: Count word occurrences
Visual diagram of map reducer execution
What is a combiner?
Optionally, before forwarding to shufflers, a ‘combiner’ operation in each node can be set up to perform a local per-key reduction - if specified, this would be ‘step 1.5’, in the above workflow.
What is functional programming?
In functional programming, functions are first-class objects that can be passed into a function as arguments; a function can also be returned from a function as output. JavaScript, Python etc.
Examples of function as input
map(), filter() and reduce()
Mapreducer use cases
process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure of web documents, summaries of the number of pages crawled per host, the set of most frequent queries in a given day, etc.
What is GFS
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
Why GFS what needed?
Why Hadoop, what was the need
It all started in the year 2002 with the Apache Nutch project.
In 2002, Doug Cutting and Mike Cafarella were working on Apache Nutch Project that aimed at building a web search engine that would crawl and index websites.
This project proved to be too expensive and thus found infeasible for indexing billions of webpages. So they were looking for a feasible solution that would reduce the cost.
To solve this they needed a way to store very large files and data, and to be able to process those large data efficiently