10. MapReduce & Hadoop Flashcards
What is MapReduce
The MapReduce paradigm offers the means to break a large task into smaller tasks, run tasks in parallel, and consolidate the outputs of the individual tasks into the final output.
This makes it very scaleable
What is Hadoop
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop is very good in handling unstructured data. Hadoop stores data in a distributed system. Written in JAVA Classes - a class is a "function" or "program". Its cost effective, scaleable, more efficient, higher throughput
What happens in the map phase
Applies an operation to a piece of data
Provides some intermediate output
What happens in the reduce phase
Consolidates the intermediate outputs from the map steps
Provides the final output
What is a key value pair
Each step uses key/value pairs, denoted as , as input and output. It is useful to think of the key/value pairs as a simple ordered pair. However, the pairs can take fairly complex forms. For example, the key could be a filename, and the value could be the entire contents of the file.
What is the HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
Breaks data into 64 MB chunks, (with some remainder chunks <64MB).
Generates 3 redundant copies of each chunk to guard against failure.
It tries to distribute the chunks across multiple computers/servers
Rack aware – knows what servers are physically next to each other
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
What are NameNodes and DataNodes
An HDFS cluster consists of a single NameNode - a master server that manages the file system namespace* and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.
What happens to a file being processed through the NameNodes and DataNodes
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
What does a NameNode do
Identifies, provides locations and tracks where various data chunks are stored.
If a data chunk is damaged or inaccessible, NameNode can replicate a redundant chunk on another server.
“The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
Usually, the replication factor is 3.”
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
What does a DataNode do
Manages the data chunks
Can identify corrupted or inaccessible data chunks and make reports to send to NameNode.
DataNodes manage storage attached to the nodes that they run on.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
What does YARN stand for and comprise of
Yet Another Resource Negotiator (YARN)
Namenode - Knows the plan
ResourceManager - Allocating resources
Scheduler (Apps Manager) - Communications
NodeManager - coordinates the resources on the datanode
AppMaster - runs things and requests resources on the datanode. Deals with starting and ending the reduce part of the job
This is a more distributed way of operating
What does the application manager do
Application manager is responsible for maintaining a list of submitted application. After application is submitted by the client, application manager firstly validates whether application requirement of resources for its application master can be satisfied or not.If enough resources are available then it forwards the application to scheduler otherwise application will be rejected.It also make sure that no other application is submitted with same application ID
What does the applications master do
The Application Master is responsible for the execution of a single application. It asks for containers from the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers. The Application Master knows the application logic and thus it is framework-specific. The MapReduce framework provides its own implementation of an Application Master.
What is pig
Pig: Provides a high-level data-flow programming language - very close to the data. Replacement for MapReduce JAVA coding
What is Hive
Hive: Provides SQL-like access - further away from the data (HIVEQL is the MySQL version of HIVE)