7. Data Formats and File Storage Flashcards
What are the differences between CSV, JSON, Avro, Parquet, and ORC formats?
CSV is a simple text format, JSON is a lightweight data interchange format, Avro is a row-based storage format with schema support, Parquet is a columnar storage format optimized for analytics, and ORC is also a columnar format designed for big data processing.
Why would you choose Parquet over JSON for big data storage?
Parquet is more efficient for analytical queries due to its columnar storage format, which allows for better compression and faster read times.
How does schema-on-read work in file-based systems?
Schema-on-read allows data to be stored without a predefined schema, applying the schema only when the data is read, enabling flexibility in data processing.
Explain data serialization and deserialization.
Data serialization is the process of converting an object into a format that can be easily stored or transmitted, while deserialization is the reverse process, converting the serialized data back into an object.
How do compression techniques like Snappy and Gzip affect performance?
Compression techniques like Snappy provide faster decompression speeds, while Gzip offers higher compression ratios but may be slower, impacting read and write performance.
What is the difference between row-based and column-based storage formats?
Row-based storage formats store data in rows, making them efficient for transactional queries, while column-based formats store data in columns, optimizing them for analytical queries.
How do you efficiently partition data for storage in S3?
Efficient partitioning in S3 involves organizing data into directories based on key attributes, which improves query performance and reduces costs.
Explain the concept of data lake and its advantages.
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale, providing flexibility, scalability, and the ability to analyze data in various formats.
How do you manage metadata in a data lake?
Metadata management in a data lake involves cataloging data assets, tracking data lineage, and ensuring data quality to facilitate data discovery and governance.
What challenges arise when storing large datasets as JSON?
Challenges include increased storage size, slower query performance, and difficulties in schema enforcement and data validation.