HDFS Flashcards

Question 1

Q

WHAT IS HDFS

Answer

A

HDFS is the distributed files system that store and access a large set of data on a cluster of commodity hardwares. its designed for high-throughput and batch processing application, with a write-once-read-many data access model prioritize data streaming over low latency

Question 2

Q

whats the architecture of HDFS

Answer

A

Master-slave architecture, namenode is the master(store metadata,file system trees and locations of the blocks); datanode is the slave, it actually stores and retrieves the blocks.

Question 3

Q

what is SPOF and how to fix it

Answer

A

single point of failure
1)High availability (active&standby namenode)
2)second namenode regularly merges the metadata and edit logs to improve the performance of HDFS
3)regularly backup and recovery on an external storage system
4)QJM ensure consistent replication of edit logs accross multiple nodes.

Question 4

Q

3 features of HDFS

Answer

A

HA;Federation;Replication(replicates and Erasure coding);

Question 5

Q

HDFS limitations

Answer

A

high-throughput but low latency;not suitable for lots of small files;

Question 6

Q

whats the two types of file formats

Answer

A

row-oriented and column oriented. row-oriented is more suitable for operational workloads, ideal for OLTP, like uploading customer profiles and adding new orders;

column-oriented is suitable for OLAP, analytical workloads, like calculating average salaries by department.

column-oriented is better compresstion; saving space at large scale;reduced i/o analytics queries by reading only required columns; skip unnecessary deserialization and can process encoded data directly

Question 7

Q

what’s parquet and how it works

Answer

A

parquet is a column-oriented open file format for query nested structures in a flat representation.

the data model structure is root(a group of fields) and fields(name, type and frequency)

Question 8

Q

how does parquet storing data in a nested way in a columnar format

Answer

A

it uses 2 ints to represent 2 different level: definition level and repetition level. definition level shows the presence of the data, handles the missing values.

repetition level indicates when to start a new list for repeated fields.
(the depth and the repeated data structure) it allows parquet to represent repeated fields efficiently while presenting their hierarchical relationships.

HDFS Flashcards

(8 cards)