HDFS Flashcards

1
Q

WHAT IS HDFS

A

HDFS is the distributed files system that store and access a large set of data on a cluster of commodity hardwares. its designed for high-throughput and batch processing application, with a write-once-read-many data access model prioritize data streaming over low latency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

whats the architecture of HDFS

A

Master-slave architecture, namenode is the master(store metadata,file system trees and locations of the blocks); datanode is the slave, it actually stores and retrieves the blocks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is SPOF and how to fix it

A

single point of failure
1)High availability (active&standby namenode)
2)second namenode regularly merges the metadata and edit logs to improve the performance of HDFS
3)regularly backup and recovery on an external storage system
4)QJM ensure consistent replication of edit logs accross multiple nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

3 features of HDFS

A

HA;Federation;Replication(replicates and Erasure coding);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

HDFS limitations

A

high-throughput but low latency;not suitable for lots of small files;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

whats the two types of file formats

A

row-oriented and column oriented. row-oriented is more suitable for operational workloads, ideal for OLTP, like uploading customer profiles and adding new orders;

column-oriented is suitable for OLAP, analytical workloads, like calculating average salaries by department.

column-oriented is better compresstion; saving space at large scale;reduced i/o analytics queries by reading only required columns; skip unnecessary deserialization and can process encoded data directly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what’s parquet and how it works

A

parquet is a column-oriented open file format for query nested structures in a flat representation.

the data model structure is root(a group of fields) and fields(name, type and frequency)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how does parquet storing data in a nested way in a columnar format

A

it uses 2 ints to represent 2 different level: definition level and repetition level. definition level shows the presence of the data, handles the missing values.

repetition level indicates when to start a new list for repeated fields.
(the depth and the repeated data structure) it allows parquet to represent repeated fields efficiently while presenting their hierarchical relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly