Azure Data Lake Flashcards

1
Q

What is the Parquet file type, and where are they used?

A
  1. Structured
  2. A Parquet file is a columnar storage file format that is designed for efficient data processing, particularly in large-scale data systems. It was created as part of the Apache Hadoop ecosystem and is commonly used in big data platforms such as Apache Spark, Apache Hive, and Apache Impala. Parquet is optimized for both storage efficiency and performance, especially for analytical and read-heavy workloads.
  3. Columnar Format: Unlike row-based formats (like CSV or Avro), Parquet stores data column by column. This means that when querying data, only the necessary columns need to be read, significantly reducing I/O operations and improving query performance, especially for large datasets.
  4. Efficient Compression: Parquet supports highly efficient compression and encoding schemes (such as Snappy, Gzip, LZO, etc.) at the column level. This reduces storage requirements and speeds up query processing by only decompressing the necessary data.
  5. Splittable: Parquet files can be split into chunks, making them ideal for distributed data processing frameworks like Hadoop and Spark, where multiple nodes can read parts of a file in parallel.
  6. Schema: Like Avro, Parquet files are schema-based, meaning the structure of the data is stored with the file itself, allowing different applications to interpret the data consistently.
  7. Supports Complex Data Types: Parquet can handle complex data types such as nested structures, arrays, and maps, which makes it versatile for different types of data and use cases.
  8. Compatibility: Parquet is compatible with various big data tools and languages, including Apache Hive, Apache Drill, Apache Impala, Apache Spark, and even cloud platforms like AWS (Amazon S3) and Azure Data Lake.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the Avro file type, and where are they used?

A
  1. Structured
  2. An Avro file is a binary file format used for serializing data, typically in big data processing and storage systems.
  3. Avro is part of the Apache Hadoop ecosystem and is often used with Apache Kafka, Apache Spark, and Apache Hive.
  4. Schema-Based: Avro uses a JSON-based schema to define the structure of the data. This schema is stored with the data, ensuring that the data can be read and written consistently even across different systems.
  5. Compact & Efficient: Avro files are compressed binary formats, making them more efficient in terms of storage space and data transfer compared to text formats like JSON or CSV.
  6. Splittable: Avro files are designed to be splittable, meaning large Avro files can be divided into smaller chunks, making them ideal for distributed systems like Hadoop, where parallel processing is common.

6 Interoperability: Since Avro stores schema information with the data, it enables cross-language interoperability. Applications written in different programming languages (such as Java, Python, etc.) can easily read and write Avro data.

  1. Schema Evolution: Avro supports schema evolution, meaning you can modify the schema of your data (e.g., adding new fields or removing existing ones) while maintaining backward and forward compatibility. This feature is useful when managing large datasets over time.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can I store in Azure Data Lake Storage Gen2?

A

Structured: Avro, Parque.
Structured: CSV, JSON, XML.
Unstructured: Images, Video.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Azure Data Lake Storage Gen2 running on top of?

A

Azure Blob Storage Service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do I create a Gen2 Azure Data Lake?

A

By creating an Azure Storage VBBlob Account and checking the Enabling higherarchy?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are examples of structured file types for Azure Data Lake?

A

Avro
Parque

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are examples of semi-structured file types for Azure Data Lake?

A

XML
JSON
YAML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are examples of unstructured file types for Azure Data Lake?

A

images
movies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What performance levels can I have with Azure Data Lake?

A

Standard (Standard Blob)
Premium (Premium Blob)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

For Azure Data Lake, do all the Azure Storage functions still work as Azure Data Lake? Is it built on top of Azure Storage?

A

Most functions but not all?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

For Azure Data Lake, do I still get lifecycle and tiered storage?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In Azure Data Lake, you get a hierarchical namespace; what do you also get in permissions?

A

Do you get POSIX permissions?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Do you access the Azure Storage Account API or Azure Data Lake data?

A

You can, but you do not have the hierarchical namespace or POSIX permissions; you should use the Azure Storage Account API DFS, and if required, the driver called

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Azure Data Lake DFS?

A

It’s the distributed files system that Azure Data Lake uses and is accessed using the API.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Azure Data Lake ABFS Driver?

A

It’s the ABFS driver that Azure Data Lake allows systems like Hadoop and Sparc to interact with Azure Data Lake.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
A