10. MapReduce & Hadoop Flashcards
What is MapReduce
The MapReduce paradigm offers the means to break a large task into smaller tasks, run tasks in parallel, and consolidate the outputs of the individual tasks into the final output.
This makes it very scaleable
What is Hadoop
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop is very good in handling unstructured data. Hadoop stores data in a distributed system. Written in JAVA Classes - a class is a "function" or "program". Its cost effective, scaleable, more efficient, higher throughput
What happens in the map phase
Applies an operation to a piece of data
Provides some intermediate output
What happens in the reduce phase
Consolidates the intermediate outputs from the map steps
Provides the final output
What is a key value pair
Each step uses key/value pairs, denoted as , as input and output. It is useful to think of the key/value pairs as a simple ordered pair. However, the pairs can take fairly complex forms. For example, the key could be a filename, and the value could be the entire contents of the file.
What is the HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
Breaks data into 64 MB chunks, (with some remainder chunks <64MB).
Generates 3 redundant copies of each chunk to guard against failure.
It tries to distribute the chunks across multiple computers/servers
Rack aware – knows what servers are physically next to each other
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
What are NameNodes and DataNodes
An HDFS cluster consists of a single NameNode - a master server that manages the file system namespace* and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.
What happens to a file being processed through the NameNodes and DataNodes
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
What does a NameNode do
Identifies, provides locations and tracks where various data chunks are stored.
If a data chunk is damaged or inaccessible, NameNode can replicate a redundant chunk on another server.
“The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
Usually, the replication factor is 3.”
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
What does a DataNode do
Manages the data chunks
Can identify corrupted or inaccessible data chunks and make reports to send to NameNode.
DataNodes manage storage attached to the nodes that they run on.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
What does YARN stand for and comprise of
Yet Another Resource Negotiator (YARN)
Namenode - Knows the plan
ResourceManager - Allocating resources
Scheduler (Apps Manager) - Communications
NodeManager - coordinates the resources on the datanode
AppMaster - runs things and requests resources on the datanode. Deals with starting and ending the reduce part of the job
This is a more distributed way of operating
What does the application manager do
Application manager is responsible for maintaining a list of submitted application. After application is submitted by the client, application manager firstly validates whether application requirement of resources for its application master can be satisfied or not.If enough resources are available then it forwards the application to scheduler otherwise application will be rejected.It also make sure that no other application is submitted with same application ID
What does the applications master do
The Application Master is responsible for the execution of a single application. It asks for containers from the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers. The Application Master knows the application logic and thus it is framework-specific. The MapReduce framework provides its own implementation of an Application Master.
What is pig
Pig: Provides a high-level data-flow programming language - very close to the data. Replacement for MapReduce JAVA coding
What is Hive
Hive: Provides SQL-like access - further away from the data (HIVEQL is the MySQL version of HIVE)
What is Mahout
Mahout: Provides analytical tools - a library
What is HBase
HBase: Provides real-time reads and writes - entire overlaid system
What is Sqoop
Sqoop is a command-line interface application for transferring data between relational databases and Hadoop
What is Apache Spark
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
What is Apache Flume
Apache Flumeis a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
What is NoSQL
NoSQL (Not only Structured Query Language) is a term used to describe those data stores that are applied to unstructured data.
As described earlier, HBase is such a tool that is ideal for storing key/values in column families.
In general, the power of NoSQL data stores is that as the size of the data grows, the implemented solution can scale by simply adding additional machines to the distributed system.
What are the four NoSQL database types
Document Databases - JSON
Graph Databases - nodes
Key-Value Databases - pairs
Wide Column Stores - table style
What is SFUNC
SFUNC = State transition function
What is PREFUNC
PREFUNC = User-defined preliminary aggregate function
For which type of tasks is MapReduce best for
Embarrassingly parallel
MapReduce is good for
Text analysis
Where data is streaming in
What are the stages of the MapReduce process
Input Input Splits (64MB) Mapping (determining the key value pairs) Shuffling Reducer Final Output
What are the two key types of programs in Hadoop
Storage
MapReduce
What are three example Java Daemons (basic set up)
NameNode - master plan (called the amenode)
DataNode - slave workers (these are the datanodes)
JobTracker - communication (on the namenode)
TaskTracker - communication (on the datanodes)
What are the three classes of Java classes (programs)
Driver -
Mapper -
Reducer - logic commands to be processed
What does a combiner do in MapReduce
Steps in before the shuffle and sort, then combines the key value pairs earlier to hopefully speed up the later stages
What does a partitioner do in MapReduce
Splits out the data into streams which can be acted on in different ways
What is ETL
Extract Transform Load
batch process
automated
Hadoop - Workflow Management & Scheduling
Oozie, Ambari, Zookeeper, Azkaban
Hadoop - Streaming / Migration
Flume, Scoop, Storm
Hadoop - Library
Mahout
Hadoop - Resource Management
YARN
Hadoop - Data Management & Storage
HDFS, HBase, Cassandra, Voldemort
Hadoop - Data Flow / Data Access
Pig, HIVE