Storage: DFS Flashcards
Distributed File System
An abstraction for data stored across multiple machines to appear as a unified storage system
Goals of DFS
- Split files
- Hide complexity
- Have Fault-tolerance
Hadoop Distributed File System (4)
- Most widely used DFS
- Clusters of commodity
hardware - Handling up to petabytes of
data - Designed for high
throughput batch (not low
latency access)
Blocks
Smallest unit of storage that can be read or written
- default size 64-128MB
How does HDFS ensure data consistency?
Write-once, Read-many model
Hadoop default Fault-tolerance
Blocks duplicated at factor of three to ensure data remains accessible if one machine fails (hardware failure = NORM!)
Namenode (NN) (2)
Master in master-slave architecture
1. Store metadata about
location of specific blocks.
2. Control client access to data
Datanodes (DN)
The slaves because they store and process the actual data. Send periodic “heartbeats” to update master
The big issue with Master-Slave architecture
Single Point of Failure (SPOF), there is only one namenode that maintains the filesystem tree
High Availability and how it’s achieved with Master-Slave Architecture
A system that can tolerate faults
Two separate machines as NNs:
- 1 Active State
- 1 Standby State
Besides HA, another reason it’s beneficial to configure additional name nodes
of blocks in system is limited by RAM of NN, since they store metadata about blocks in memory
Erasure Coding
A way to store less redundant data by splitting into smaller data cells called “Stripes”
with
Parity cells as backup pieces to help recover data.
If you lose some data cells, you can still rebuild thanks to parity cells. (Data cells are, of course split across nodes)
Pros of erasure coding (2)
- Reduce data redundancy
from 200% to 50% - Faster writes without
replicating
Cons of Erasure Coding (3)
- Higher cpu cost for reads
and writes - Longer recovery time in
case of failure - Loss of data locality (too
much splitting!!)
For which types of datasets does erasure coding work best?
Those with low I/O activities (not HOT or interactive)
HDFS cons in general (2)
- High latency
- Not good if there
are many small files
limited by memory in NN.
Standard data formats (3)
- (Structured) CSV
- (Semi-structured) JSON files 3. (Unstructured) textual files
3 benefits of big data specific file formats (BCS)
- Binary serialization
- Compression
- Splittability
Row-oriented file format
Store data in rows so it’s easy to read or write full record of fields.
Good application for Row oriented format
OLTP
Updating customers profile, adding a new order (access to all fields for the record)
Column-oriented file format
store data by columns makes easier compression because same data type is stored in same record (type specific encodings)
Easier to access specific features of data for aggregation (reduced I/O for analytical queries)
Good application for column oriented format
OLAP
Average salary of employees by department (get all instances for one field with one query!)
Row-oriented file formats examples (3)
SequenceFiles (key-value pairs, designed for MapReduce)
Apache Thrift (good communication between programs)
Apache Avro (schema evolution)
Column-oriented file formats examples (2)
ORC (first on Hadoop, designed for Hive - sql on hadoop)
Apache Parquet (schema evolution)
Parquet
Columnar storage format for efficient querying of nested structures in a flat format