Storage: DFS Flashcards
Distributed File System
abstraction for data stored across multiple machines to appear as a unified storage system
Goals of DFS
- hide complexity (abstraction)
- Splitting files (BD support)
- Fault-tolerance
Hadoop Distributed File System
One of the most widely used DFS, running on clusters of commodity hardware handling up to petabytes of data, designed for high throughput batch processing rather than low latency access
Blocks
smallest units of storage that can be read or written, default 64-128MB
How does HDFS ensure data consistency?
write-once, read-many model
Fault tolerance
hardware failure is the norm, blocks duplicated at factor of three to ensure data remains accessible if one machine fails
Namenode (NN)
The master node in master-slave architecture, stores metadata about location of specific blocks. Controls access to files by clients
Datanodes (DN)
The slaves because they store and process the actual data. Send periodic “heartbeats” to update master
What is the catch with master-slave architecture?
Single Point of Failure (SPOF), there is only one namenode that maintains the filesystem tree
High Availability
a system that can tolerate faults, achieved by configuring two separate machines as NNs (active state and standy state) with shared storage
another reason its beneficial to configure additional name nodes
number of blocks is limited by size of NN, since they store metadata in their memory
Erasure Coding
A way to store less redundant data by splitting into smaller “data cells” or stripes and along with “parity cells” as backup pieces to help recover data. If you lose some data cells, you can still rebuild thanks to parity cells. Data cells are, of course split across nodes.
pros of erasure coding (2)
reduce data redundancy from 200% to 50%
faster writes (stripes are distributed, smaller than full replication)
cons of erasure coding (3)
higher cpu cost for reads/writes
longer recovery time in case of failure
loss of data locality (too much splitting!!)
where does erasure coding work best (type of dataset)
datasets with low I/0 activities (not HOT or interactive)