Hadoop and Pig Flashcards
What specific concept is HDFS designed to support?
HDFS is designed to support high streaming read performance and follows the concept of “write once and read many times.” It doesn’t encourage frequent updates on files.
How does HDFS ensure fault tolerance in its data storage?
by storing data in fixed blocks on distributed nodes called DataNodes, replicating them to handle potential failures.
How does Hadoop achieve its purpose?
It scales from one server to thousands, making it easy to process large datasets. It’s designed to stay available even if some computers fail, ensuring reliability on less-than-perfect systems.
What is the significance of heartbeats in HDFS, and how does the NameNode use them to manage DataNodes?
Heartbeats in HDFS are signals sent by DataNodes to the NameNode. The NameNode uses heartbeats to track individual DataNodes and manage the overall health of the system.
What happens if the NameNode doesn’t receive a response from a specific DataNode in HDFS?
it considers that DataNode as failed or faulty, and appropriate actions are taken to maintain system integrity and fault tolerance.
What is the role of the NameNode in the File System Namespace?
The NameNode manages and maintains the FileSystem Namespace, keeping track of all Namespace properties and information
What is the default size of data blocks in HDFS, and why is it designed to be much larger than the standard file block size?
The default size of data blocks in HDFS is 128 MB, significantly larger than the standard file block size of 512 bytes. This larger size is chosen for fault tolerance and availability through replication.
What potential issue arises from using small blocks in HDFS, and what does this issue result in?
The use of small blocks in HDFS leads to a large number of files, causing considerable interaction between the NameNode and DataNodes. This interaction creates overhead for the entire process.
What are the two main components of an HDFS cluster, and what are their respective roles?
An HDFS cluster includes a single NameNode (managing metadata) and multiple DataNodes (managing storage)
How does a file get stored in an HDFS cluster, and what role does the NameNode play in this process?
Files in HDFS are split into blocks, stored in DataNodes; the NameNode maps blocks to DataNodes.
How does HDFS handle data block storage, and what triggers communication between DataNodes and the NameNode?
HDFS stores data blocks locally initially and communicates with the NameNode for the next DataNode when a block is full.
What is the role of the replication factor in HDFS, and when is it determined?
The replication factor, set during file creation, ensures fault tolerance by creating copies on different DataNodes. If not specified, the default is three.
How does Hadoop’s rack awareness enhance fault tolerance, and what’s a drawback associated with it?
by distributing replicas across racks. However, it increases I/O costs due to block transfer across racks.
What happens when a DataNode fails health checks in HDFS, and how does the system handle block replication?
the NameNode removes the block from the pipeline and re-replicates it to a different DataNode.