Storage: DFS Flashcards
Distributed File System
An abstraction for data stored across multiple machines to appear as a unified storage system
Goals of DFS
- Split files
- Hide complexity
- Have Fault-tolerance
Hadoop Distributed File System (4)
- Most widely used DFS
- Clusters of commodity
hardware - Handling up to petabytes of
data - Designed for high
throughput batch (not low
latency access)
Blocks
Smallest unit of storage that can be read or written
- default size 64-128MB
How does HDFS ensure data consistency?
Write-once, Read-many model
Hadoop default Fault-tolerance
Blocks duplicated at factor of three to ensure data remains accessible if one machine fails (hardware failure = NORM!)
Namenode (NN) (2)
Master in master-slave architecture
1. Store metadata about
location of specific blocks.
2. Control client access to data
Datanodes (DN)
The slaves because they store and process the actual data. Send periodic “heartbeats” to update master
The big issue with Master-Slave architecture
Single Point of Failure (SPOF), there is only one namenode that maintains the filesystem tree
High Availability and how it’s achieved with Master-Slave Architecture
A system that can tolerate faults
Two separate machines as NNs:
- 1 Active State
- 1 Standby State
Besides HA, another reason it’s beneficial to configure additional name nodes
of blocks in system is limited by RAM of NN, since they store metadata about blocks in memory
Erasure Coding
A way to store less redundant data by splitting into smaller data cells called “Stripes”
with
Parity cells as backup pieces to help recover data.
If you lose some data cells, you can still rebuild thanks to parity cells. (Data cells are, of course split across nodes)
Pros of erasure coding (2)
- Reduce data redundancy
from 200% to 50% - Faster writes without
replicating
Cons of Erasure Coding (3)
- Higher cpu cost for reads
and writes - Longer recovery time in
case of failure - Loss of data locality (too
much splitting!!)
For which types of datasets does erasure coding work best?
Those with low I/O activities (not HOT or interactive)