BigData - (H)DFS & File formats Flashcards
DFS (Distributed file system) - Chunk size:
Is it more efficient to have a large or small chunk size if you want to store big files in DFS?
Larger chunk sizes are more efficient, as you reduce the amount of write/read operations you need to make.
DFS - Chunk size:
How many mb are taken in the database, if your chunk size is 5mb, and you want to save 4 files of 50kb each.
If you save them one after the other, you would take up 20mb of space instead of 200kb, as each chunk is configured to hold 5mb.
The rest is wasted space!
DFS:
How would you ensure that data is stored redundantly in DFS?
Typically 3 chunk replicas, on different machines on different racks - thus when a rack or machine dies, it’s still available elsewhere.
Data Lake Zones:
What are the Transient, Raw, Trusted & Refined zones?
Transient zone:
Moving the data from store to zones
Raw zone:
The raw data
Trusted zone:
Validating data and its use
Refined zone:
At this point we have analyzed the typical dataset we had, and have generated a new insight into the data - i.e. you brought new value
Why have many data lakes turned into data swamps?
Because there is too much data which is not coupled together and is not described by metadata. Thus no one knows what the data means, and cannot use it for anything.
Data warehouse/lake/lakehouse:
What is ETL, and what are the use cases for it?
Extract Transform Load
Used in traditional data warehouse scenarios
Extract data from various sources or databases
Transform the data, clean, filter, aggregate or convert it into suitable format (e.g. avro/parquet)
Load the data into the targeted warehouse or storage system
Data warehouse/lake/lakehouse:
What is ELT, and what are the use cases for it?
Extract Load Transform
Used in bigData env where raw data is stored in data lake
Extract data from various sources or databases
Load RAW data into the target data storage, without transformation
Transform within the target system, use Spark or Hive to process/analyze
Data warehouse/lake/lakehouse:
What is EtLT, and what are the use cases for it?
Extract Transform Load Transform
Used when data is complex, and you need to transform it multiple times.
Extract data from various sources or databases
Transform the data, clean, filter, aggregate or convert it into suitable format (e.g Avro / Parquet)
Load data into the target data storage, without transformation
Transform within the target system, use Spark or Hive to process/analyze
The Hadoop Setup:
What it the Name node? (master node)
Its responsible for coordination, job scheduling and metadata management.
It controls the Data nodes, and makes sure that data is replicated correctly, and that the Data nodes are alive.
The Hadoop Setup:
How can you prevent the Name node from being a single point of failure?
To prevent it, you have a “shadow” master, which shadows the Name node and takes over instantly, if the Name node goes down.
The Hadoop Setup:
What do the Data nodes do in HDFS?
The data nodes are where you store the actual data in chunks. Each block of data is replicated across multiple data nodes.
The Hadoop Setup:
What happens when you want to read data from HDFS?
The Name node tells you the closest server you should read data from. You then read directly from the Data nodes.
If the closest Data node is down, the Name node directs you to the next closest Data node.
The Hadoop Setup:
What happens when you want to write data to HDFS?
The Name node tells you where you should write, and then you write directly to that node. Once you have written to the first chunk server, that server coordinates the other 2 places it should be written to. When all 3 have been written to it tells you that you have written successfully.
File formats:
What are some defining differences between Avro and Parquet?
Avro:
Row-based, stored in human readable JSON, able to evolve the schema
Parquet:
Column-based, stored in non human readable, not able to evolve the schema (easily).
File formats:
Let’s say you want to be able to store order data, and later you want to be able to read all orders between 2 dates. Which data format would suit this best?
Storing it in Avro means that you could read e.g. every row from index 5000 to 6001. Doing this means you don’t need to read the entire file, but only the orders you look for.