7. Data Formats and File Storage Flashcards

Question 1

Q

What are the differences between CSV, JSON, Avro, Parquet, and ORC formats?

Answer

A

CSV is a simple text format, JSON is a lightweight data interchange format, Avro is a row-based storage format with schema support, Parquet is a columnar storage format optimized for analytics, and ORC is also a columnar format designed for big data processing.

Question 2

Q

Why would you choose Parquet over JSON for big data storage?

Answer

A

Parquet is more efficient for analytical queries due to its columnar storage format, which allows for better compression and faster read times.

Question 3

Q

How does schema-on-read work in file-based systems?

Answer

A

Schema-on-read allows data to be stored without a predefined schema, applying the schema only when the data is read, enabling flexibility in data processing.

Question 4

Q

Explain data serialization and deserialization.

Answer

A

Data serialization is the process of converting an object into a format that can be easily stored or transmitted, while deserialization is the reverse process, converting the serialized data back into an object.

Question 5

Q

How do compression techniques like Snappy and Gzip affect performance?

Answer

A

Compression techniques like Snappy provide faster decompression speeds, while Gzip offers higher compression ratios but may be slower, impacting read and write performance.

Question 6

Q

What is the difference between row-based and column-based storage formats?

Answer

A

Row-based storage formats store data in rows, making them efficient for transactional queries, while column-based formats store data in columns, optimizing them for analytical queries.

Question 7

Q

How do you efficiently partition data for storage in S3?

Answer

A

Efficient partitioning in S3 involves organizing data into directories based on key attributes, which improves query performance and reduces costs.

Question 8

Q

Explain the concept of data lake and its advantages.

Answer

A

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale, providing flexibility, scalability, and the ability to analyze data in various formats.

Question 9

Q

How do you manage metadata in a data lake?

Answer

A

Metadata management in a data lake involves cataloging data assets, tracking data lineage, and ensuring data quality to facilitate data discovery and governance.

Question 10

Q

What challenges arise when storing large datasets as JSON?

Answer

A

Challenges include increased storage size, slower query performance, and difficulties in schema enforcement and data validation.

7. Data Formats and File Storage Flashcards

(10 cards)