Section 3 - Azure Data Lake (ADLS) Flashcards
What is a Data Lake?
A Data Lake is simply a central data storage or repository that holds
massive amounts of data in its original format until it’s needed for operational use or data mining.
This data can be in the form of structured or unstructured data and can include things like audio images, video, log, files, tables, et cetera.
Data Lakes differ from Data Warehouses in that the data is not governed or structured on the way into the Data Lake.
This makes it extremely efficient for storing and processing massive amounts of data.
DP-200 focuses on Generation 2.
This is a combination of the features in Generation 1, which are
- a file system,
- semantics
- directories
- file level security
- recovery from Azure Blob storage
- combined with tiered storage and disaster
Blob storage is storage that is optimized for storing massive amounts of
unstructured data and does not adhere to a particular data model or
definition.
What are the Key Characteristics of Azure Data Lake (ADLS)?
- Designed for enterprise big data analytics.
- The Data Lake is able to store and serve many exabytes of data (1 exabyte = 250 million DVDs. 5 exabytes = all words ever spoken by human beings).
- The Azure Data Lake is a Hadoop file system.
- That’s compatible with major Hadoop distributions (e.g. Kafka, Spark)
- Designed to support throughput for parallel processing scenarios
- Optimized for high-speed data ingestion at large scale,
- making it ideal for micro-transactions or internet of things data storage.
- Hierarchical namespace - hierarchy of directories and nested sub-directories (easy to access - like File Explorer),
- This is a major difference between the Data Lake storage and Blob storage
How are File Accessed in ADLS?
Storage Explorer in preview (web) mode does not allow for the uploading of files
directly - needs the downloaded version
- Interact with accounts.
- Create new folders
- upload files,
- manage metadata
- manage access
How to Reduce ADLS Costs?
The less compute power required, the less cost associated.
Utilize HDFS, parallel processing, and H and S file structures to reduce the amount of processing, which reduces the overall cost.
Cons of ADLS?
- Data Lake is unstructured - it does not have a schema - data can be hard to consume or query large amounts of data
- Data Lakes also have an inherent challenge of managing data quality
- Unless good protocol is put into place prior to the movement of data,
- Challenges can arise with who has access to the data, the quality of the data, and who manages the data pipelines
Describe the Process of ETL
- Data Factory moves data from the source application into the Data Lake.
- Once in the Data Lake, it is typical to see zones that the data lives in.
- The first zone is a raw zone and should hold raw data from source applications.
- That data then gets moved through Data Factory and out into a transformational tool, such as Databricks.
- In Databricks, the data gets refined and then moved back into the Data Lake again, via a Data Factory process.
- When it moves back into the Data Lake, the data will go into a different zone containing more refined data that should also be of higher data quality.
- Data Factory can also take multiple zones within a Data Lake and combine files to get views of data from multiple business groups.
- This data may or may not flow through a transformational step before being combined.
- Finally, as the data becomes more refined, it becomes easier to see the end requirements for the business and the data is then pulled out of the Data Lake.
- There are a variety of different toolsets.
How to Secure ADLS?
- Authentication is handled through Azure Active Directory OAuth tokens.
- File and folder security and access are handled through the traditional Azure role-based access control, or RBAC.
- Data can be encrypted at rest and in transit,
- Firewalls are provided to restrict where data may be accessed from
ADLS Summary Sheet