Describe common formats for data files Flashcards
CSV (Comma-Separated Values)
CSV is a plain text format where data is separated by commas. Each line represents a record, and fields are separated by commas.
Use Cases: Widely used for exchanging data between applications, databases, and spreadsheet software.
Example:
FirstName,LastName,Email
Joe,Jones,joe@litware.com
JSON (JavaScript Object Notation):
JSON is a lightweight data-interchange format. It uses a human-readable text format to represent data objects consisting of attribute–value pairs.
Use Cases: Commonly used for web APIs, configuration files, and exchanging data between applications.
Example:
{
“name”: “John Doe”,
“age”: 30,
“city”: “New York”,
“isStudent”: false,
“grades”: [85, 90, 78],
“address”: {
“street”: “123 Main St”,
“zipCode”: “10001”
}
}
XML (eXtensible Markup Language)
XML is a markup language that defines rules for encoding documents in a format that is both human-readable and machine-readable.
Use Cases: Frequently used for data interchange between heterogeneous systems, configuration files, and representing structured data.
Example:
<Customers>
<Customer>
<ContactDetails>
<Contact></Contact>
<Contact></Contact>
</ContactDetails>
</Customer>
</Customers>
Optimized File Formats
While human-readable formats for structured and semi-structured data can be useful, they’re typically not optimized for storage space or processing. Over time, some specialized file formats that enable compression, indexing, and efficient storage and processing have been developed.
Avro
Avro is a binary serialization format developed within the Apache Hadoop project. It provides a compact, fast, and efficient way to serialize data.
Use Cases: Commonly used in Apache Kafka for message serialization, and in Hadoop for efficient data storage and processing.
ORC (Optimized Row Columnar format)
Organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that supports fast data summarization and querying over large datasets).
An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
Parquet
Parquet is a columnar storage file format optimized for use with big data processing frameworks. It is often used in Apache Hadoop ecosystems.
Use Cases: Suitable for efficient storage and processing of large datasets in distributed computing environments.