Big Data Lecture 04 Distributed File Systems Flashcards

1
Q

What are two different types of data by source?

A

<ul><li>Raw data: sensors, events, logs...</li><li>derived data: aggregated data, intermediate data...</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are two types of big data? And which is handle by HDFS and which by Cloud Storage?

A

<ul><li>A huge amount (billions) of large (&lt;TB) files (handled by S3 or Azure),</li><li>or large amount (millions) of huge (&lt;PB) files (handled by HDFS).</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How are files organized in HDFS?

A

<ul><li>File structure,</li><li>underneath which is block storage.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do we require fault tolerance in big file systems?

A

If there is a large number of nodes, the system is guaranteed to fail from time to time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the file reading and updating model in DFS?

A

<ul><li>Reading: we need to scan the file, we do not need random access,</li><li>update: we need to append to the file, possibly from many different atomic clients.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are performance requirements of DFS?

A

We want the bottleneck to be throughput, not latency.<br></br><br></br>That is, we do not want to spend time fetching stuff, and then waiting for it to go onto the network. We want to be limited by the network speed, that is why we prefer bigger blocks, and we fetch those.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we solve discrepancy of capacity and throughput, and throughput and latency?

A

<ul><li><span>Paralelization,</span></li><li><span>batch processing.</span></li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the size of the block in HDFS?

A

64/128 MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the network architecture of HDFS?

A

One coordinator node, the rest are worker nodes, however, they can still talk to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How are files treated in HDFS?

A

They are split into blocks, which are replicated over worker nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does coordinator node hold in HDFS?

A
  1. File namespace (structure in the file system),<br></br>2. file to block mapping,<br></br>3. block locations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What if a file is not exactly a multiple of 128 MB sized?

A

The final block will not take up more space, only the space that is needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does NameNode talk to client?

A

Client sends requests, and namenode sends DataNode locations and IDs of blocks. Client has Java API to hold stuff together to actually download this data from the DataNode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does NameNode talk to the DataNode?

A

It sends a heartbeat every 3s, saying it is alive, and if it has received any new blocks, in response to that NameNode sends commands back.<br></br><br></br>Every 6 hrs it sends a block report of the whole storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does download of a file look over HDFS?

A

The clients streams from different nodes at the same time and then recollects and assembles the blocks together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does user add a file to the HDFS?

A

They first talk to the NameNode, which sends instructions about where to upload to (per block), then it sends a packet of 64 kB for each file, for each of those it receives acknowledgement that it has finished uploading and replicating to all the nodes.<br></br><br></br>Finally, acknowledgment and release of lock comes from the NameNode.

17
Q

How are replicas placed in HDFS?

A
  1. Same node as the client (easy replication),<br></br>2. node in a different rack B,<br></br>3. node in the same rack B.<br></br>Further: random rack, but if possible at most one replica per node and at most two replicas per rack.
18
Q

What is the weak point of HDFS?

A

The NameNode, there is only single one, and if it fails, we have a problem.

19
Q

What data is backed-up from the NameNode and how?

A

The directory tree and mapping of files from blocks is saved.<br></br><br></br>We do not save which DataNodes have which blocks, cause those can be easily recovered from BlockReport that is given every 6 hours.<br></br><br></br>We store (onto a separate drive, or on cloud) into a Namespace file, on top of which we store Edit log, which is merged to the Namespace file from time to time.

20
Q

How long does it take to restart the NameNode and why?

A

We need to apply the Edit log onto the Namespace, so we need around 30 minutes for that.

21
Q

How can we achieve High Availability (HA)?

A

We need to make the NameNode redundant, this can be done in multiple ways:<br></br><ul><li>Standby NameNodes,</li><li>Observer NameNodes</li><li>Federated DFS.</li></ul>

22
Q

What are Standby NameNodes? What is Quorum Journal Manager?

A

Running duplicate of the NameNode, which is inspecting the tail of the edit log and copying it. It also receives reports from DataNodes.<br></br><br></br>The NameNodes can be parralelized and as long as the majority agrees, then the system can be kept alive and running.

23
Q

What is Observer NameNode?

A

StandbyNode++, additionally can serve requests for reading data from DataNodes.

24
Q

What is Federated DFS?

A

Different parts of directories get a different NameNode, additionally, there exists duplication between the nodes, so one could be a standby node for the other.

25
Q

How to use HDFS locally?

A

From the command line, one can use<br></br><i>hadoop fs <args></i><br></br>which will work with basically any system using correct domain (local - with FILE as domain, or S3 and HDFS).<br></br><br></br>In the flags, we can specify commands like<br></br><i>-ls, -cat, -rm, -mkdir, -copyFromLocal, -copyToLocal</i>

26
Q

What is Apache Flume?

A

Tool that collects, aggregates and moves the log data into HDFS.

27
Q

What is Apache Sqoop?

A

Tool that imports data from relational database into HDFS.

28
Q

How to create a directory in HDFS?

A

First talk to the Name node, obtain DataNode ID, then talk to that one and upload.

29
Q

Is DynamoDB CAP?

A

No, it is not time constistent, it uses vector clocks! It is AP though.