Hadoop Ecosystem Fundamentals of Distributed Systems continued Flashcards
What is Hadoop Distributed File System (HDFS)?
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop for storing large datasets across clusters of commodity hardware.
It is designed for reliability, scalability, and fault tolerance, with data distributed across multiple nodes in the cluster.
What is MapReduce in Hadoop?
MapReduce is a programming model and processing engine used by Hadoop for distributed data processing.
It consists of two main phases: the Map phase for processing input data in parallel across multiple nodes, and the Reduce phase for aggregating and processing intermediate results.
What is YARN in Hadoop?
YARN is a resource management and job scheduling framework in Hadoop.
It decouples resource management from job scheduling, allowing multiple data processing engines (such as MapReduce, Apache Spark, and Apache Flink) to run on the same cluster.
What is the Hadoop ecosystem?
The Hadoop ecosystem refers to the collection of open-source projects and tools built around the Hadoop core components, including HDFS, MapReduce, and YARN.
It includes projects for data ingestion, processing, storage, querying, and visualization, such as Apache Hive, Apache HBase, Apache Spark, Apache Pig, and Apache Kafka.
What are the key components of the Hadoop architecture?
The key components of the Hadoop architecture include HDFS for storage, MapReduce for processing, and YARN for resource management.
Additional components in the Hadoop ecosystem provide functionalities such as SQL querying (Apache Hive), real-time processing (Apache Kafka), NoSQL databases (Apache HBase), and machine learning (Apache Spark MLlib).
What are nodes in Hadoop Distributed File System (HDFS)?
Nodes in HDFS refer to the individual machines that make up the Hadoop cluster.
Each node typically consists of commodity hardware and serves a specific role in the Hadoop ecosystem.
What is the NameNode in HDFS?
The NameNode is the master node in HDFS that manages the file system namespace and controls access to files by clients.
It stores metadata about files and directories, such as file permissions, ownership, and block locations.
What is a DataNode in HDFS?
DataNodes are worker nodes in HDFS responsible for storing and managing the actual data blocks of files.
They store data on the local filesystem and communicate with the NameNode to report storage capacity and block status.
What is the Secondary NameNode in HDFS?
The Secondary NameNode is a helper node in HDFS that performs periodic checkpoints of the NameNode’s namespace and edits log.
It does not act as a backup for the NameNode but helps in reducing the time taken for NameNode recovery in case of failure.
How are data blocks distributed across DataNodes in HDFS?
Data blocks are replicated and distributed across multiple DataNodes in the HDFS cluster to ensure fault tolerance and high availability.
The replication factor determines the number of copies of each block, and blocks are placed on different racks to minimize data transfer over the network.
What is the Edit Log in Hadoop’s HDFS?
The Edit Log is a file in Hadoop’s HDFS that records all modifications made to the file system namespace metadata, such as file creations, deletions, and modifications.
It acts as a persistent transaction log, allowing the system to recover the file system’s state in the event of a failure.
What is the FSImage in Hadoop’s HDFS?
The FSImage is a snapshot of the file system namespace metadata at a particular point in time in Hadoop’s HDFS.
It contains information about the directory structure, file permissions, ownership, and block locations.
How are Edit Log and FSImage used in checkpointing?
In Hadoop’s HDFS, the Edit Log and FSImage are used together in a process called checkpointing.
During checkpointing, the current state of the file system metadata in memory is written to the FSImage, and the Edit Log is cleared.
How are Edit Log and FSImage used in recovery?
In the event of a failure, Hadoop’s HDFS uses the FSImage and the Edit Log to recover the file system’s state.
The FSImage provides a consistent snapshot of the file system metadata, while the Edit Log contains a record of all transactions since the last checkpoint.
What is the impact of Edit Log and FSImage on HDFS performance?
The frequent writing of transactions to the Edit Log can impact HDFS performance due to disk I/O.
However, periodic checkpointing and optimizations in the way transactions are written can help mitigate this impact.
What is the NameNode in Hadoop’s HDFS?
The NameNode is a critical component of Hadoop’s HDFS and is a service responsible for managing the metadata of the file system.
It maintains the namespace hierarchy, file permissions, and mapping of data blocks to DataNodes.
The NameNode is a single point of failure in the HDFS architecture, and its high availability is ensured through mechanisms like standby NameNodes and automatic failover.
How does a write occur in HDFS?
A write in HDFS begins with a client sending a write request to the NameNode, specifying the file to be written and the data to be written to that file.
What happens after the client sends a write request to the NameNode in HDFS?
Upon receiving the write request, the NameNode determines the list of DataNodes where the data blocks will be stored.
It allocates a list of suitable DataNodes for the data blocks based on factors such as availability, proximity, and load.
How does the client write data to HDFS after block allocation?
Once the block allocation is complete, the client establishes direct connections with the chosen DataNodes.
It streams the data directly to the selected DataNodes in the form of data packets.
What happens to the data blocks after they are written to the primary DataNodes?
After the primary DataNodes receive the data blocks, they replicate the data to additional DataNodes according to the configured replication factor.
Replication ensures fault tolerance and data availability by storing multiple copies of each block on different nodes in the cluster.
How does the client know the write operation was successful in HDFS?
After successfully writing all data blocks to the primary and replicated DataNodes, the client receives acknowledgment messages from the DataNodes.
Once the client receives acknowledgments from a sufficient number of DataNodes, it considers the write operation successful.
What is YARN in Hadoop 2.x architecture?
YARN is a resource management and job scheduling framework introduced in Hadoop 2.x.
It decouples the resource management and job scheduling functionalities of Hadoop MapReduce, allowing multiple data processing frameworks to run on the same Hadoop cluster.