Hadoop and Pig Flashcards
What specific concept is HDFS designed to support?
HDFS is designed to support high streaming read performance and follows the concept of “write once and read many times.” It doesn’t encourage frequent updates on files.
How does HDFS ensure fault tolerance in its data storage?
by storing data in fixed blocks on distributed nodes called DataNodes, replicating them to handle potential failures.
How does Hadoop achieve its purpose?
It scales from one server to thousands, making it easy to process large datasets. It’s designed to stay available even if some computers fail, ensuring reliability on less-than-perfect systems.
What is the significance of heartbeats in HDFS, and how does the NameNode use them to manage DataNodes?
Heartbeats in HDFS are signals sent by DataNodes to the NameNode. The NameNode uses heartbeats to track individual DataNodes and manage the overall health of the system.
What happens if the NameNode doesn’t receive a response from a specific DataNode in HDFS?
it considers that DataNode as failed or faulty, and appropriate actions are taken to maintain system integrity and fault tolerance.
What is the role of the NameNode in the File System Namespace?
The NameNode manages and maintains the FileSystem Namespace, keeping track of all Namespace properties and information
What is the default size of data blocks in HDFS, and why is it designed to be much larger than the standard file block size?
The default size of data blocks in HDFS is 128 MB, significantly larger than the standard file block size of 512 bytes. This larger size is chosen for fault tolerance and availability through replication.
What potential issue arises from using small blocks in HDFS, and what does this issue result in?
The use of small blocks in HDFS leads to a large number of files, causing considerable interaction between the NameNode and DataNodes. This interaction creates overhead for the entire process.
What are the two main components of an HDFS cluster, and what are their respective roles?
An HDFS cluster includes a single NameNode (managing metadata) and multiple DataNodes (managing storage)
How does a file get stored in an HDFS cluster, and what role does the NameNode play in this process?
Files in HDFS are split into blocks, stored in DataNodes; the NameNode maps blocks to DataNodes.
How does HDFS handle data block storage, and what triggers communication between DataNodes and the NameNode?
HDFS stores data blocks locally initially and communicates with the NameNode for the next DataNode when a block is full.
What is the role of the replication factor in HDFS, and when is it determined?
The replication factor, set during file creation, ensures fault tolerance by creating copies on different DataNodes. If not specified, the default is three.
How does Hadoop’s rack awareness enhance fault tolerance, and what’s a drawback associated with it?
by distributing replicas across racks. However, it increases I/O costs due to block transfer across racks.
What happens when a DataNode fails health checks in HDFS, and how does the system handle block replication?
the NameNode removes the block from the pipeline and re-replicates it to a different DataNode.
What model does HDFS adopt and what is its single point of failure?
Master/slave model where the slave is a single point of failure
Give a solution to recovering data when a NameNode fails in HDFS
Use two NameNodes, one active and one on standby. The standby node takes over as the active node in the case of failure
What is the performance trade-off in MapReduce
Increasing the number of computing nodes enhances computation speedup in distributed systems. However, this improvement comes with a trade-off of higher data volume for intermediate reports that need to be exchanged among the nodes.
When not to use MapReduce
small dataset
need to process data in real time
processing data streams
Give a brief description of each of the Pig Components
- The Parser processes Pig Scripts, ensuring correct syntax, performing type checks, and generating a directed acyclic graph (DAG) to represent Pig Latin statements and their logical flow.
- The Optimizer refines the logical plan (DAG) by executing optimizations like projection and pushdown, enhancing the efficiency of subsequent processing.
- The Compiler translates the optimized logical plan into a series of MapReduce jobs, facilitating the execution of data processing tasks.
- The Execution Engine submits and manages the execution of MapReduce jobs on Hadoop, orchestrating the processing of data to produce the desired results.
What are some of the advantages of using Pig over MapReduce
High-level language
Ease of use
Significant reduction in code length
No need for compilation
What does Pig’s infrastructure layer consist of, and how does it leverage existing large-scale parallel implementations?
Pig’s infrastructure layer includes a compiler producing sequences of Map-Reduce programs, leveraging existing large-scale parallel implementations like Hadoop
What is MapReduce
A programming paradigm for processing big datasets in a distributed fashion
What are steps in MapReduce
- Prepare the input to Map by selecting a key
- Execute Map in each node and generate output based on another key
- Shuffle the output from Map to the Reduce nodes
- Execute Reduce
- Produce the final output
What is the salient (key) property of Pig
the structure of Pig programs is conducive to leveraging parallel processing, making them well-suited for handling very large datasets in a distributed computing environment.
What is the application workflow in YARN
- Client submits an application
- Resource Manager allocates a container to start the application manager
- The application manager registers itself with the resource manager
- The Application Manager negotiates containers from the Resource Manager
- The Application Manager notifies the Node Manager to launch containers
- Application code is executed in the container
- Client contacts Resource Manager/Application Manager to monitor application’s status
- Once the processing is complete, the Application Manager un-registers with the Resource Manager