HDFS Flashcards

Question 1

Q

What is HDFS?

Answer

A

It is Hadoop Distributed File System which allows to store large files across a cluster of computers which can be low-cost hardware. It is part of the Hadoop Framework.

Question 2

Q

What are the HDFS components?

Answer

A

Name Node and Data Node.

There are the main types of NameNodes:
Active which is responsible for storing metadata such as file locations, file permissions, name and locations of blocks. It holds the data in memory but it is also stored in disk for persistence storage.
Secondary which is responsible for merging the editlogs (sequence of changes made to the filesystem after the namenode started) with the fsimage (the snapshot of the filesystem when the namenode started). It is also known as the Checkpoint Node.
Standby which is similar to the Secondary with with failover capabilities where it simply acts as a slave, maintaining the state synchronised with the Active Node.

DataNode holds the actual data blocks. These are responsible for serving read and write requests for the client.

Question 3

Q

What is fencing in HDFS?

Answer

A

Process of ensuring that only one active node remains active.

Question 4

Q

What is Checkpointing in HDFS?

Answer

A

Process that merges the editlogs into the fsimage so that the NameNode can reduce the startuptime when it needs to restart, for example.

Question 5

Q

What is the default block size in HDFS and replication factor, and explain how a client performs a read and write operation

Answer

A

Files are split into 128 MB blocks of data which are replicated (3 replicas) into different nodes for reliability and availability.

If you want to read from HDFS, the client would first talk to the NameNode to figure out where that file is stored and reply back with the DataNodes holding the blocks.

If you want to write to HDFS, the client would first talk to the NameNode so that it stores the metadata, and then talks to the DataNodes to store the actual blocks of data.

(Might want to review this one)

Question 6

Q

What is the small file problem in Hadoop?

Answer

A

HDFS can’t handle lots of small files because the NameNode saves information about each block in memory. The memory required to store the metadata is high and can not scale beyond a limit.

Too Many Small Files = Too Many Blocks
Too Many Blocks = Too Many Metadata
Too Many Metadata = RAM is exhausted and increases the cost of seek

A Hadoop Archive archives small files into a single archive.

MapReduce processes a block of data at a time. Many small files means lots of blocks which means a lot of tasks and overall reducing the performance.

Question 7

Q

What is Data Locality?

Answer

A

Move computation to data. Instead of moving data, we move code. Tasks (Map, Reduce) are copied to nodes. Data is not copied to tasks. This minimizes the network I/O since tasks are read from disk instead of network.

Question 8

Q

What are the key features of HDFS?

Answer

A

It is highly fault-tolerant (due to block replication), with high throughput (due to parallel processing) suitable for applications with large data sets.

Question 9

Q

What is a block and block scanner in HDFS?

Answer

A

The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors.

Question 10

Q

What is a job tracker?

Answer

A

It is a daemon (service or process that runs in the background) that runs on a NameNode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker.

Question 11

Q

What is a task tracker?

Answer

A

Task tracker is a daemon that runs on DataNodes. Manages the execution of individual tasks.

Question 12

Q

Can you change files at arbitrary locations in HDFS?

Answer

A

No. It only supports append only format. Writes to a file in HDFS are always made at the end of the file.

Question 13

Q

Suppose there is a file of size 514 MB stored in HDFS using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block?

Answer

A

Default Block Size = 128 MB
514 MB / 128 MB = 4.05 = 5 Blocks

Replication Factor = 3
Total Blocks = 5 * 3 = 15
Total Size = 514 MB * 3 = 1542 MB

Question 14

Q

Can multiple clients write into an HDFS file concurrently?

Answer

A

HDFS follows Single Writer Multiple Reader Model

The client which opens a file for writing is granted a lease by the NameNode.
NameNode rejects write request of other clients for the file which is currently being written by someone else.

Question 15

Q

What do you mean by High Availability of a NameNode? How is it achieved?

Answer

A

Using a StandBy NameNode which is in-sync with the Active NameNode. You could achieve the same thing with the Secondary NameNode but that requires manual intervention which means there’ll be a downtime.