Storage Flashcards

1
Q

What is the Hadoop Distributed File System (HDFS)?

A

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. It is optimized for high throughput and batch processing rather than low-latency operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the primary goals of a Distributed File System (DFS)?

A

The goals include abstraction (treating multiple disks as one unified disk), big data support (splitting files into blocks stored on different machines), and fault tolerance (replicating data to prevent data loss).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are files split into blocks in HDFS?

A

Files are split into blocks to handle sizes larger than a single disk can store, simplify storage management, and make replication easier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the typical sizes for HDFS blocks?

A

HDFS blocks range from 64MB to 1GB, with a default size of 128MB. This larger block size reduces the number of seeks required when reading large files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What roles do Namenodes (NN) and Datanodes (DN) play in HDFS?

A

The Namenode acts as the master, maintaining the filesystem tree and metadata, while Datanodes store and retrieve blocks and report to the Namenode with their active status and stored block information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is the Namenode (NN) a single point of failure (SPoF)?

A

If the Namenode fails, the filesystem cannot reconstruct the files from the blocks stored in the Datanodes, making it crucial for the NN to have high-availability solutions like backup mechanisms and Secondary Namenodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does HDFS ensure high availability (HA)?

A

HDFS supports HA by configuring two separate Namenodes, one active and one standby. The standby NN stays up-to-date with the active NN’s metadata through shared edit logs and reports from Datanodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is HDFS federation?

A

HDFS federation uses multiple independent Namenodes, each managing a portion of the filesystem. This improves performance, availability, scalability, and flexibility by avoiding a single NN bottleneck.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the default replication factor in HDFS and how does it work?

A

The default replication factor is 3. The first replica is stored on the node issuing the write (if within the cluster), the second on a different rack, and the third on a different node in the same rack as the second.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Erasure Coding (EC) in HDFS?

A

EC is an alternative to replication that reduces data redundancy by creating parity blocks for data blocks. It splits data into stripes distributed across nodes, reducing storage overhead from 200% to 50% in typical setups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the advantages and disadvantages of Erasure Coding?

A

Advantages include reduced storage redundancy and efficient distribution. Disadvantages include higher CPU costs, longer recovery times, and loss of data locality due to distributed stripes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the limitations of HDFS?

A

HDFS is not ideal for applications requiring low-latency access (e.g., sub-millisecond access) or handling billions of small files due to the Namenode’s memory limitations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does data transfer work in HDFS during read operations?

A

A client contacts the Namenode to find the block locations, and then directly retrieves the data from the relevant Datanodes, ensuring that data transfer bypasses the Namenode for efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What types of file formats are used in big data infrastructure?

A

Standard formats include structured (e.g., CSV), semi-structured (e.g., JSON), and specialized big data formats like column-oriented (e.g., Parquet) and row-oriented (e.g., SequenceFiles).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the benefits of column-oriented file formats for big data?

A

They offer better compression, reduced I/O for analytical queries (only relevant columns are read), and optimized storage through type-specific encodings and skipping unnecessary deserialization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Apache Parquet and its main use?

A

Apache Parquet is a columnar storage format designed for efficient querying of nested data structures. It supports schema evolution and is interoperable with multiple big data frameworks.

17
Q

What is the purpose of repetition and definition levels in Parquet?

A

These levels help map and reconstruct nested data structures from flat columnar storage, indicating when new lists start (repetition level) and whether a value is null or defined (definition level).

18
Q

What is dictionary encoding in Parquet?

A

It is a compression technique that maps column values to a dictionary key, which reduces space usage when columns have a limited number of distinct values. Use numbered keys instead of string values.