BigData - (H)DFS & File formats Flashcards

Question 1

Q

DFS (Distributed file system) - Chunk size:
Is it more efficient to have a large or small chunk size if you want to store big files in DFS?

Answer

A

Larger chunk sizes are more efficient, as you reduce the amount of write/read operations you need to make.

Question 2

Q

DFS - Chunk size:
How many mb are taken in the database, if your chunk size is 5mb, and you want to save 4 files of 50kb each.

Answer

A

If you save them one after the other, you would take up 20mb of space instead of 200kb, as each chunk is configured to hold 5mb.
The rest is wasted space!

Question 3

Q

DFS:
How would you ensure that data is stored redundantly in DFS?

Answer

A

Typically 3 chunk replicas, on different machines on different racks - thus when a rack or machine dies, it’s still available elsewhere.

Question 4

Q

Data Lake Zones:
What are the Transient, Raw, Trusted & Refined zones?

Answer

A

Transient zone:
Moving the data from store to zones

Raw zone:
The raw data

Trusted zone:
Validating data and its use

Refined zone:
At this point we have analyzed the typical dataset we had, and have generated a new insight into the data - i.e. you brought new value

Question 5

Q

Why have many data lakes turned into data swamps?

Answer

A

Because there is too much data which is not coupled together and is not described by metadata. Thus no one knows what the data means, and cannot use it for anything.

Question 6

Q

Data warehouse/lake/lakehouse:
What is ETL, and what are the use cases for it?

Answer

A

Extract Transform Load
Used in traditional data warehouse scenarios

Extract data from various sources or databases
Transform the data, clean, filter, aggregate or convert it into suitable format (e.g. avro/parquet)
Load the data into the targeted warehouse or storage system

Question 7

Q

Data warehouse/lake/lakehouse:
What is ELT, and what are the use cases for it?

Answer

A

Extract Load Transform
Used in bigData env where raw data is stored in data lake

Extract data from various sources or databases
Load RAW data into the target data storage, without transformation
Transform within the target system, use Spark or Hive to process/analyze

Question 8

Q

Data warehouse/lake/lakehouse:
What is EtLT, and what are the use cases for it?

Answer

A

Extract Transform Load Transform
Used when data is complex, and you need to transform it multiple times.

Extract data from various sources or databases
Transform the data, clean, filter, aggregate or convert it into suitable format (e.g Avro / Parquet)
Load data into the target data storage, without transformation
Transform within the target system, use Spark or Hive to process/analyze

Question 9

Q

The Hadoop Setup:
What it the Name node? (master node)

Answer

A

Its responsible for coordination, job scheduling and metadata management.
It controls the Data nodes, and makes sure that data is replicated correctly, and that the Data nodes are alive.

Question 10

Q

The Hadoop Setup:
How can you prevent the Name node from being a single point of failure?

Answer

A

To prevent it, you have a “shadow” master, which shadows the Name node and takes over instantly, if the Name node goes down.

Question 11

Q

The Hadoop Setup:
What do the Data nodes do in HDFS?

Answer

A

The data nodes are where you store the actual data in chunks. Each block of data is replicated across multiple data nodes.

Question 12

Q

The Hadoop Setup:
What happens when you want to read data from HDFS?

Answer

A

The Name node tells you the closest server you should read data from. You then read directly from the Data nodes.
If the closest Data node is down, the Name node directs you to the next closest Data node.

Question 13

Q

The Hadoop Setup:
What happens when you want to write data to HDFS?

Answer

A

The Name node tells you where you should write, and then you write directly to that node. Once you have written to the first chunk server, that server coordinates the other 2 places it should be written to. When all 3 have been written to it tells you that you have written successfully.

Question 14

Q

File formats:
What are some defining differences between Avro and Parquet?

Answer

A

Avro:
Row-based, stored in human readable JSON, able to evolve the schema

Parquet:
Column-based, stored in non human readable, not able to evolve the schema (easily).

Question 15

Q

File formats:
Let’s say you want to be able to store order data, and later you want to be able to read all orders between 2 dates. Which data format would suit this best?

Answer

A

Storing it in Avro means that you could read e.g. every row from index 5000 to 6001. Doing this means you don’t need to read the entire file, but only the orders you look for.

Question 16

Q

File formats:
Which file format is best, if you wish to make analytics and computation on some attributes of order data. E.g. you wish to calculate the mean difference between prices of items and their prices on discount.

Answer

A

Storing it in Parquet means you can read only the 2 columns “prices” and “discount_prices” from the storage, instead of having to read the entire order row.

Question 17

Q

File formats:
Which format is best suited for a storing log data, where you initially have columns like “timestamp” and “log_level”, but later you want to update the data schema by adding “message” and “ip_address”?

Answer

A

Avro is best suited as it allows you to evolve your schema, and is able to handle null values in previous row entries.
Parquet would not be too good, as you cannot (easily) update the data schema later.

Question 18

Q

The Hadoop Setup:
What information is stored in the Name nodes metadata?

Answer

A

Chunk size,
Replication count,
Permission and ownership,
Access and last update times