HDFS Flashcards
WHAT IS HDFS
HDFS is the distributed files system that store and access a large set of data on a cluster of commodity hardwares. its designed for high-throughput and batch processing application, with a write-once-read-many data access model prioritize data streaming over low latency
whats the architecture of HDFS
Master-slave architecture, namenode is the master(store metadata,file system trees and locations of the blocks); datanode is the slave, it actually stores and retrieves the blocks.
what is SPOF and how to fix it
single point of failure
1)High availability (active&standby namenode)
2)second namenode regularly merges the metadata and edit logs to improve the performance of HDFS
3)regularly backup and recovery on an external storage system
4)QJM ensure consistent replication of edit logs accross multiple nodes.
3 features of HDFS
HA;Federation;Replication(replicates and Erasure coding);
HDFS limitations
high-throughput but low latency;not suitable for lots of small files;
whats the two types of file formats
row-oriented and column oriented. row-oriented is more suitable for operational workloads, ideal for OLTP, like uploading customer profiles and adding new orders;
column-oriented is suitable for OLAP, analytical workloads, like calculating average salaries by department.
column-oriented is better compresstion; saving space at large scale;reduced i/o analytics queries by reading only required columns; skip unnecessary deserialization and can process encoded data directly
what’s parquet and how it works
parquet is a column-oriented open file format for query nested structures in a flat representation.
the data model structure is root(a group of fields) and fields(name, type and frequency)
how does parquet storing data in a nested way in a columnar format
it uses 2 ints to represent 2 different level: definition level and repetition level. definition level shows the presence of the data, handles the missing values.
repetition level indicates when to start a new list for repeated fields.
(the depth and the repeated data structure) it allows parquet to represent repeated fields efficiently while presenting their hierarchical relationships.