10. MapReduce & Hadoop Flashcards by Tom Burgoyne

What is MapReduce

The MapReduce paradigm offers the means to break a large task into smaller tasks, run tasks in parallel, and consolidate the outputs of the individual tasks into the final output.

This makes it very scaleable

How well did you know this?

Not at all

Perfectly

What is Hadoop

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
Hadoop is very good in handling unstructured data.
Hadoop stores data in a distributed system.  Written in JAVA Classes - a class is a "function" or "program".
Its cost effective, scaleable, more efficient, higher throughput

How well did you know this?

Not at all

Perfectly

What happens in the map phase

Applies an operation to a piece of data

Provides some intermediate output

How well did you know this?

Not at all

Perfectly

What happens in the reduce phase

Consolidates the intermediate outputs from the map steps

Provides the final output

How well did you know this?

Not at all

Perfectly

What is a key value pair

Each step uses key/value pairs, denoted as , as input and output. It is useful to think of the key/value pairs as a simple ordered pair. However, the pairs can take fairly complex forms. For example, the key could be a filename, and the value could be the entire contents of the file.

How well did you know this?

Not at all

Perfectly

What is the HDFS

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
Breaks data into 64 MB chunks, (with some remainder chunks <64MB).
Generates 3 redundant copies of each chunk to guard against failure.
It tries to distribute the chunks across multiple computers/servers
Rack aware – knows what servers are physically next to each other
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

How well did you know this?

Not at all

Perfectly

What are NameNodes and DataNodes

An HDFS cluster consists of a single NameNode - a master server that manages the file system namespace* and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

How well did you know this?

Not at all

Perfectly

What happens to a file being processed through the NameNodes and DataNodes

Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

How well did you know this?

Not at all

Perfectly

What does a NameNode do

Identifies, provides locations and tracks where various data chunks are stored.
If a data chunk is damaged or inaccessible, NameNode can replicate a redundant chunk on another server.
“The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
Usually, the replication factor is 3.”
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

How well did you know this?

Not at all

Perfectly

What does a DataNode do

Manages the data chunks
Can identify corrupted or inaccessible data chunks and make reports to send to NameNode.
DataNodes manage storage attached to the nodes that they run on.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

How well did you know this?

Not at all

Perfectly

What does YARN stand for and comprise of

Yet Another Resource Negotiator (YARN)
Namenode - Knows the plan
ResourceManager - Allocating resources
Scheduler (Apps Manager) - Communications

NodeManager - coordinates the resources on the datanode
AppMaster - runs things and requests resources on the datanode. Deals with starting and ending the reduce part of the job

This is a more distributed way of operating

How well did you know this?

Not at all

Perfectly

What does the application manager do

Application manager is responsible for maintaining a list of submitted application. After application is submitted by the client, application manager firstly validates whether application requirement of resources for its application master can be satisfied or not.If enough resources are available then it forwards the application to scheduler otherwise application will be rejected.It also make sure that no other application is submitted with same application ID

How well did you know this?

Not at all

Perfectly

What does the applications master do

The Application Master is responsible for the execution of a single application. It asks for containers from the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers. The Application Master knows the application logic and thus it is framework-specific. The MapReduce framework provides its own implementation of an Application Master.

How well did you know this?

Not at all

Perfectly

What is pig

Pig: Provides a high-level data-flow programming language - very close to the data. Replacement for MapReduce JAVA coding

How well did you know this?

Not at all

Perfectly

What is Hive

Hive: Provides SQL-like access - further away from the data (HIVEQL is the MySQL version of HIVE)

How well did you know this?

Not at all

Perfectly

What is Mahout

Study These Flashcards

Mahout: Provides analytical tools - a library

What is HBase

Study These Flashcards

HBase: Provides real-time reads and writes - entire overlaid system

What is Sqoop

Study These Flashcards

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop

What is Apache Spark

Study These Flashcards

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

What is Apache Flume

Study These Flashcards

Apache Flumeis a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

What is NoSQL

Study These Flashcards

NoSQL (Not only Structured Query Language) is a term used to describe those data stores that are applied to unstructured data.
As described earlier, HBase is such a tool that is ideal for storing key/values in column families.
In general, the power of NoSQL data stores is that as the size of the data grows, the implemented solution can scale by simply adding additional machines to the distributed system.

What are the four NoSQL database types

Study These Flashcards

Document Databases - JSON
Graph Databases - nodes
Key-Value Databases - pairs
Wide Column Stores - table style

What is SFUNC

Study These Flashcards

SFUNC = State transition function

What is PREFUNC

Study These Flashcards

PREFUNC = User-defined preliminary aggregate function

For which type of tasks is MapReduce best for

Embarrassingly parallel

MapReduce is good for

Text analysis | Where data is streaming in

What are the stages of the MapReduce process

``` Input Input Splits (64MB) Mapping (determining the key value pairs) Shuffling Reducer Final Output ```

What are the two key types of programs in Hadoop

Storage | MapReduce

What are three example Java Daemons (basic set up)

NameNode - master plan (called the amenode) DataNode - slave workers (these are the datanodes) JobTracker - communication (on the namenode) TaskTracker - communication (on the datanodes)

What are the three classes of Java classes (programs)

Driver - Mapper - Reducer - logic commands to be processed

What does a combiner do in MapReduce

Steps in before the shuffle and sort, then combines the key value pairs earlier to hopefully speed up the later stages

What does a partitioner do in MapReduce

Splits out the data into streams which can be acted on in different ways

What is ETL

Extract Transform Load batch process automated

Hadoop - Workflow Management & Scheduling

Oozie, Ambari, Zookeeper, Azkaban

Hadoop - Streaming / Migration

Flume, Scoop, Storm

Hadoop - Library

Mahout

Hadoop - Resource Management

YARN

Hadoop - Data Management & Storage

HDFS, HBase, Cassandra, Voldemort

Hadoop - Data Flow / Data Access

Pig, HIVE

10. MapReduce & Hadoop Flashcards

(39 cards)