HDFS & RDD Flashcards
What are distributed filesystems and why do we need them?
When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems.
Name the three (plus two) V’s of big data.
High:
- Volume: The quantity of generated and stored data.
- Velocity: the speed at which data is generated and processed.
- Variety: The type and nature of the data.
- Variability: Inconsistency of the dataset.
- Veracity: The quality of captured data.
Name 5 desirable aspects of a big data storage layer
- Scalability: Handle the ever-increasing data sizes
- Simplicity: Hide complexity from the developers
- Efficiency: Fast access to data
- Fault-tolerance: failures do not lead to loss of data.
- Fast access from RAM for Hot data
What is meant with a block in the context of HDFS? What are the benefits*?
Files in HDFS are broken into block-sized chunks, which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage (e.g. a block of 1MB will uses 1MB of space). A block on HDFS is typically 128MB large.
The main benefits are:
1. Files that are too big for a hard drive can be split up in multiple smaller pieces, that can be processed separately
2. Making the unit of abstraction a block rather than a file simplifies the storage subsystem.
3. Blocks fit well with replication for providing fault tolerance and availability.
What are namenodes and datanodes? What are the differences?
An HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers).
- The namenode manages the filesystem namespace. It mainatains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located.
- Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by the namenode), and they report back to the namenode peridocally with lists of blocks that they are storing.
Give a step-by-stept approach on how HDFS stores data in a distributive manner. What happens if a datanode fails?
- Hadoop partitions a file into blocks (of 128MB for example)
- Each block is replicated to multiple datanodes
- If a datanode fails, the namenode will know which blocks were present on that datanode, and then can create replica’s of those blocks in other datanodes.
How does HDFS incorporate scalability, simplicity, efficiency and fault tolerance (if at all)?
- Scalability: Add more nodes
- Simplicity: Developers doe not need to know where blocks are stored
- Efficiency: HDFS is actually quite slow because it reads from HDD which requires I/O to/from disk.
- Fault-tolerance: Data is replicated over multiple datanodes. If one fails, data is not lost.
How does Spark’s RDD solve the shortcomings of HDFS?
Spark’s RDD add an intermediate storage layer between processing and the hard disk in the form of RAM. ‘Hot’ data, data that is recent or needs to be processed, is stored in RAM until it overflows. Everything that does not fit in RAM will be stored in HDD. This is similar to a caching mechanism used in other contexts, like web browsing.
What are Spark RDDs? What are the 5 core properties of them?
RDD stands for Resilient Distributed Dataset. They are like HDFS distributed, fault tolerant and the elements can be operated on in parallel.
RDDs can either be created by loading data from some stable storage, for example from HDFS, or existing RDDs can be manupilated.
5 properties:
- Immutable: read only, once created, they cannot be changed
- Distributed: The file is is partitioned and can be operated on in parallel.
- Lazily evaluated: If you want to process an RDD, you define the processing steps in code. The chain of processes will be executed as soon as you read the RDD.
- Cachable: RDDs are stored in memory by default.
- Replicated: they are partitioned into blocks and replicated across multiple machines, which makes them fault-tolerant.
Explain the types of content that RDDs can have and how RDDs can be restored if data is lost.
- Details about the data, like data location or data itself
- Lineage information: How was this RDD created, which dependencies exist from other RDDs. What transformations were run on this RDD. If we lose data, we can recreate it from scratch by looking at the lineage information.
Example of lineage information:
RDD2 = RDD1.filter(…)
RDD3 = RDD2.map(…)
Now we know how RDD3 was created.