Storage Flashcards
What is the Hadoop Distributed File System (HDFS)?
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. It is optimized for high throughput and batch processing rather than low-latency operations.
What are the primary goals of a Distributed File System (DFS)?
The goals include abstraction (treating multiple disks as one unified disk), big data support (splitting files into blocks stored on different machines), and fault tolerance (replicating data to prevent data loss).
Why are files split into blocks in HDFS?
Files are split into blocks to handle sizes larger than a single disk can store, simplify storage management, and make replication easier.
What are the typical sizes for HDFS blocks?
HDFS blocks range from 64MB to 1GB, with a default size of 128MB. This larger block size reduces the number of seeks required when reading large files.
What roles do Namenodes (NN) and Datanodes (DN) play in HDFS?
The Namenode acts as the master, maintaining the filesystem tree and metadata, while Datanodes store and retrieve blocks and report to the Namenode with their active status and stored block information.
Why is the Namenode (NN) a single point of failure (SPoF)?
If the Namenode fails, the filesystem cannot reconstruct the files from the blocks stored in the Datanodes, making it crucial for the NN to have high-availability solutions like backup mechanisms and Secondary Namenodes.
How does HDFS ensure high availability (HA)?
HDFS supports HA by configuring two separate Namenodes, one active and one standby. The standby NN stays up-to-date with the active NN’s metadata through shared edit logs and reports from Datanodes.
What is HDFS federation?
HDFS federation uses multiple independent Namenodes, each managing a portion of the filesystem. This improves performance, availability, scalability, and flexibility by avoiding a single NN bottleneck.
What is the default replication factor in HDFS and how does it work?
The default replication factor is 3. The first replica is stored on the node issuing the write (if within the cluster), the second on a different rack, and the third on a different node in the same rack as the second.
What is Erasure Coding (EC) in HDFS?
EC is an alternative to replication that reduces data redundancy by creating parity blocks for data blocks. It splits data into stripes distributed across nodes, reducing storage overhead from 200% to 50% in typical setups.
What are the advantages and disadvantages of Erasure Coding?
Advantages include reduced storage redundancy and efficient distribution. Disadvantages include higher CPU costs, longer recovery times, and loss of data locality due to distributed stripes.
What are the limitations of HDFS?
HDFS is not ideal for applications requiring low-latency access (e.g., sub-millisecond access) or handling billions of small files due to the Namenode’s memory limitations.
How does data transfer work in HDFS during read operations?
A client contacts the Namenode to find the block locations, and then directly retrieves the data from the relevant Datanodes, ensuring that data transfer bypasses the Namenode for efficiency.
What types of file formats are used in big data infrastructure?
Standard formats include structured (e.g., CSV), semi-structured (e.g., JSON), and specialized big data formats like column-oriented (e.g., Parquet) and row-oriented (e.g., SequenceFiles).
What are the benefits of column-oriented file formats for big data?
They offer better compression, reduced I/O for analytical queries (only relevant columns are read), and optimized storage through type-specific encodings and skipping unnecessary deserialization.