Azure Data Lake Flashcards
What is the Parquet file type, and where are they used?
- Structured
- A Parquet file is a columnar storage file format that is designed for efficient data processing, particularly in large-scale data systems. It was created as part of the Apache Hadoop ecosystem and is commonly used in big data platforms such as Apache Spark, Apache Hive, and Apache Impala. Parquet is optimized for both storage efficiency and performance, especially for analytical and read-heavy workloads.
- Columnar Format: Unlike row-based formats (like CSV or Avro), Parquet stores data column by column. This means that when querying data, only the necessary columns need to be read, significantly reducing I/O operations and improving query performance, especially for large datasets.
- Efficient Compression: Parquet supports highly efficient compression and encoding schemes (such as Snappy, Gzip, LZO, etc.) at the column level. This reduces storage requirements and speeds up query processing by only decompressing the necessary data.
- Splittable: Parquet files can be split into chunks, making them ideal for distributed data processing frameworks like Hadoop and Spark, where multiple nodes can read parts of a file in parallel.
- Schema: Like Avro, Parquet files are schema-based, meaning the structure of the data is stored with the file itself, allowing different applications to interpret the data consistently.
- Supports Complex Data Types: Parquet can handle complex data types such as nested structures, arrays, and maps, which makes it versatile for different types of data and use cases.
- Compatibility: Parquet is compatible with various big data tools and languages, including Apache Hive, Apache Drill, Apache Impala, Apache Spark, and even cloud platforms like AWS (Amazon S3) and Azure Data Lake.
What is the Avro file type, and where are they used?
- Structured
- An Avro file is a binary file format used for serializing data, typically in big data processing and storage systems.
- Avro is part of the Apache Hadoop ecosystem and is often used with Apache Kafka, Apache Spark, and Apache Hive.
- Schema-Based: Avro uses a JSON-based schema to define the structure of the data. This schema is stored with the data, ensuring that the data can be read and written consistently even across different systems.
- Compact & Efficient: Avro files are compressed binary formats, making them more efficient in terms of storage space and data transfer compared to text formats like JSON or CSV.
- Splittable: Avro files are designed to be splittable, meaning large Avro files can be divided into smaller chunks, making them ideal for distributed systems like Hadoop, where parallel processing is common.
6 Interoperability: Since Avro stores schema information with the data, it enables cross-language interoperability. Applications written in different programming languages (such as Java, Python, etc.) can easily read and write Avro data.
- Schema Evolution: Avro supports schema evolution, meaning you can modify the schema of your data (e.g., adding new fields or removing existing ones) while maintaining backward and forward compatibility. This feature is useful when managing large datasets over time.
What can I store in Azure Data Lake Storage Gen2?
Structured: Avro, Parque.
Structured: CSV, JSON, XML.
Unstructured: Images, Video.
What is Azure Data Lake Storage Gen2 running on top of?
Azure Blob Storage Service.
How do I create a Gen2 Azure Data Lake?
By creating an Azure Storage VBBlob Account and checking the Enabling higherarchy?
What are examples of structured file types for Azure Data Lake?
Avro
Parque
What are examples of semi-structured file types for Azure Data Lake?
XML
JSON
YAML
What are examples of unstructured file types for Azure Data Lake?
images
movies
What performance levels can I have with Azure Data Lake?
Standard (Standard Blob)
Premium (Premium Blob)
For Azure Data Lake, do all the Azure Storage functions still work as Azure Data Lake? Is it built on top of Azure Storage?
Most functions but not all?
For Azure Data Lake, do I still get lifecycle and tiered storage?
Yes
In Azure Data Lake, you get a hierarchical namespace; what do you also get in permissions?
Do you get POSIX permissions?
Do you access the Azure Storage Account API or Azure Data Lake data?
You can, but you do not have the hierarchical namespace or POSIX permissions; you should use the Azure Storage Account API DFS, and if required, the driver called
What is the Azure Data Lake DFS?
It’s the distributed files system that Azure Data Lake uses and is accessed using the API.
What is the Azure Data Lake ABFS Driver?
It’s the ABFS driver that Azure Data Lake allows systems like Hadoop and Sparc to interact with Azure Data Lake.