Big Data Lecture 04 Distributed File Systems Flashcards
What are two different types of data by source?
<ul><li>Raw data: sensors, events, logs...</li><li>derived data: aggregated data, intermediate data...</li></ul>
What are two types of big data? And which is handle by HDFS and which by Cloud Storage?
<ul><li>A huge amount (billions) of large (<TB) files (handled by S3 or Azure),</li><li>or large amount (millions) of huge (<PB) files (handled by HDFS).</li></ul>
How are files organized in HDFS?
<ul><li>File structure,</li><li>underneath which is block storage.</li></ul>
Why do we require fault tolerance in big file systems?
If there is a large number of nodes, the system is guaranteed to fail from time to time.
What is the file reading and updating model in DFS?
<ul><li>Reading: we need to scan the file, we do not need random access,</li><li>update: we need to append to the file, possibly from many different atomic clients.</li></ul>
What are performance requirements of DFS?
We want the bottleneck to be throughput, not latency.<br></br><br></br>That is, we do not want to spend time fetching stuff, and then waiting for it to go onto the network. We want to be limited by the network speed, that is why we prefer bigger blocks, and we fetch those.
How do we solve discrepancy of capacity and throughput, and throughput and latency?
<ul><li><span>Paralelization,</span></li><li><span>batch processing.</span></li></ul>
What is the size of the block in HDFS?
64/128 MB
What is the network architecture of HDFS?
One coordinator node, the rest are worker nodes, however, they can still talk to each other.
How are files treated in HDFS?
They are split into blocks, which are replicated over worker nodes.
What does coordinator node hold in HDFS?
- File namespace (structure in the file system),<br></br>2. file to block mapping,<br></br>3. block locations.
What if a file is not exactly a multiple of 128 MB sized?
The final block will not take up more space, only the space that is needed.
How does NameNode talk to client?
Client sends requests, and namenode sends DataNode locations and IDs of blocks. Client has Java API to hold stuff together to actually download this data from the DataNode.
How does NameNode talk to the DataNode?
It sends a heartbeat every 3s, saying it is alive, and if it has received any new blocks, in response to that NameNode sends commands back.<br></br><br></br>Every 6 hrs it sends a block report of the whole storage.
How does download of a file look over HDFS?
The clients streams from different nodes at the same time and then recollects and assembles the blocks together.