Big Data Lecture 04 Distributed File Systems Flashcards

Question 1

Q

What are two different types of data by source?

Answer

A

<ul><li>Raw data: sensors, events, logs...</li><li>derived data: aggregated data, intermediate data...</li></ul>

Question 2

Q

What are two types of big data? And which is handle by HDFS and which by Cloud Storage?

Answer

A

<ul><li>A huge amount (billions) of large (<TB) files (handled by S3 or Azure),</li><li>or large amount (millions) of huge (<PB) files (handled by HDFS).</li></ul>

Question 3

Q

How are files organized in HDFS?

Answer

A

<ul><li>File structure,</li><li>underneath which is block storage.</li></ul>

Question 4

Q

Why do we require fault tolerance in big file systems?

Answer

A

If there is a large number of nodes, the system is guaranteed to fail from time to time.

Question 5

Q

What is the file reading and updating model in DFS?

Answer

A

<ul><li>Reading: we need to scan the file, we do not need random access,</li><li>update: we need to append to the file, possibly from many different atomic clients.</li></ul>

Question 6

Q

What are performance requirements of DFS?

Answer

A

We want the bottleneck to be throughput, not latency. That is, we do not want to spend time fetching stuff, and then waiting for it to go onto the network. We want to be limited by the network speed, that is why we prefer bigger blocks, and we fetch those.

Question 7

Q

How do we solve discrepancy of capacity and throughput, and throughput and latency?

Answer

A

<ul><li>Paralelization,</li><li>batch processing.</li></ul>

Question 8

Q

What is the size of the block in HDFS?

Answer

A

64/128 MB

Question 9

Q

What is the network architecture of HDFS?

Answer

A

One coordinator node, the rest are worker nodes, however, they can still talk to each other.

Question 10

Q

How are files treated in HDFS?

Answer

A

They are split into blocks, which are replicated over worker nodes.

Question 11

Q

What does coordinator node hold in HDFS?

Answer

A

File namespace (structure in the file system), 2. file to block mapping, 3. block locations.

Question 12

Q

What if a file is not exactly a multiple of 128 MB sized?

Answer

A

The final block will not take up more space, only the space that is needed.

Question 13

Q

How does NameNode talk to client?

Answer

A

Client sends requests, and namenode sends DataNode locations and IDs of blocks. Client has Java API to hold stuff together to actually download this data from the DataNode.

Question 14

Q

How does NameNode talk to the DataNode?

Answer

A

It sends a heartbeat every 3s, saying it is alive, and if it has received any new blocks, in response to that NameNode sends commands back. Every 6 hrs it sends a block report of the whole storage.

Question 15

Q

How does download of a file look over HDFS?

Answer

A

The clients streams from different nodes at the same time and then recollects and assembles the blocks together.

Question 16

Q

How does user add a file to the HDFS?

Answer

A

They first talk to the NameNode, which sends instructions about where to upload to (per block), then it sends a packet of 64 kB for each file, for each of those it receives acknowledgement that it has finished uploading and replicating to all the nodes. Finally, acknowledgment and release of lock comes from the NameNode.

Question 17

Q

How are replicas placed in HDFS?

Answer

A

Same node as the client (easy replication), 2. node in a different rack B, 3. node in the same rack B. Further: random rack, but if possible at most one replica per node and at most two replicas per rack.

Question 18

Q

What is the weak point of HDFS?

Answer

A

The NameNode, there is only single one, and if it fails, we have a problem.

Question 19

Q

What data is backed-up from the NameNode and how?

Answer

A

The directory tree and mapping of files from blocks is saved. We do not save which DataNodes have which blocks, cause those can be easily recovered from BlockReport that is given every 6 hours. We store (onto a separate drive, or on cloud) into a Namespace file, on top of which we store Edit log, which is merged to the Namespace file from time to time.

Question 20

Q

How long does it take to restart the NameNode and why?

Answer

A

We need to apply the Edit log onto the Namespace, so we need around 30 minutes for that.

Question 21

Q

How can we achieve High Availability (HA)?

Answer

A

We need to make the NameNode redundant, this can be done in multiple ways: <ul><li>Standby NameNodes,</li><li>Observer NameNodes</li><li>Federated DFS.</li></ul>

Question 22

Q

What are Standby NameNodes? What is Quorum Journal Manager?

Answer

A

Running duplicate of the NameNode, which is inspecting the tail of the edit log and copying it. It also receives reports from DataNodes. The NameNodes can be parralelized and as long as the majority agrees, then the system can be kept alive and running.

Question 23

Q

What is Observer NameNode?

Answer

A

StandbyNode++, additionally can serve requests for reading data from DataNodes.

Question 24

Q

What is Federated DFS?

Answer

A

Different parts of directories get a different NameNode, additionally, there exists duplication between the nodes, so one could be a standby node for the other.

Question 25

Q

How to use HDFS locally?

Answer

A

From the command line, one can use hadoop fs <args> which will work with basically any system using correct domain (local - with FILE as domain, or S3 and HDFS). In the flags, we can specify commands like -ls, -cat, -rm, -mkdir, -copyFromLocal, -copyToLocal

Question 26

Q

What is Apache Flume?

Answer

A

Tool that collects, aggregates and moves the log data into HDFS.

Question 27

Q

What is Apache Sqoop?

Answer

A

Tool that imports data from relational database into HDFS.

Question 28

Q

How to create a directory in HDFS?

Answer

A

First talk to the Name node, obtain DataNode ID, then talk to that one and upload.

Question 29

Q

Is DynamoDB CAP?

Answer

A

No, it is not time constistent, it uses vector clocks! It is AP though.