Ch. 1 - Get started with data engineering on Azure Flashcards
What is Data Integration?
Establishing links between data sources to enable access to data across multiple systems.
What is data transformation?
Transforming operational data into suitable structure and format for analysis. Often part of extract, transform, and load process.
What is data consolidation?
Combining data that has been extracted from multiple data sources into a consistent structure - supports analytics and reporting.
What is operational data?
Operational data is typically transactional data that is generated and stored by apps, often in a non-relational or relational database.
What is analytical data?
Analytical data has been optimized for analysis and reporting, often in a data warehouse.
What is streaming data?
Refers to perpetual sources of data that generate data values in-real time, relating to specific events (IoT).
What is a data lake?
Storage repository that holds large amounts of data in native, raw formats. Optimized for scaling to MASSIVE volumes, comes from multiple heterogeneous sources, may be structured, semi-structured, or unstructured.
Store everything in its original, untransformed state.
What is a data warehouse?
Centralized repo of integrated data from one or more disparate sources. Stores current and historical data in relational tables that are organized into a schema that optimizes performance for analytical queries.
Data engineers are responsible for designing and implementing relational data warehouses, and managing regular data loads into tables.
What is Apache Spark?
Parallel processing framework that takes advantage of in-memory processing and a distributed file storage.
Data engineers need to be proficient with Spark, using notebooks and other code artifacts to process data in a data lake and prepare it for modeling and analysis.
How is Azure Data Lake Gen2 Hadoop compatible?
- Can treat data as if its HDFS, stored in one location, and access it via compute tech. (Azure Databricks, Azure HDInsight, and Azure Synapse Analytics) without moving the data. Also have access to parquet (columnar) format.
Explain ADLG2 Security Features
- Supports Access Control Lists (ACLs) and Portable Operating System Interface (POSIX) permissions that don’t inherit the permissions of the parent dir. Security is configurable via Hive, Spark, or Azure Storage Explorer.
Explain ADLG2 File Storage
- Stores data into a hierarchy of directories, like a file system for ease of navigation.
Explain ADLG2 Data Redundancy
- Data redundancy - Data Lake Storage takes advantage of Azure Blob replication models with LRS / GRS options.
How are Blob files stored?
Blobs allow for large amounts of unstructured (“object”) data in a flat namespace within a blob container. Names include “/” characters to organize blobs into virtual “folders”, but are actually stored as a single-level hierarchy in a flat namespace. Accessed via HTTP/HTTPS.
How does Azure Data Lake Storage Gen2 compare to Azure Blobs?
Azure Data Lake Storage Gen2 builds on blob storage and optimizes I/O of high-volume data by using a hierarchical namespace that organizes blob data into directories, and stores metadata about each directory and the files within it.
This structure allows operations, such as directory renames and deletes, to be performed in a single atomic operation.
What are the four stages of processing big data?
Ingest, Store, Prep and Train, Model and Serve
Data lakes have a fundamental role in a wide range of big data architectures. These architectures can involve the creation of:
An enterprise data warehouse.
Advanced analytics against big data.
A real-time analytical solution.
What is Data Ingestion in processing big data?
Ingest - acquire source data. I.e. batch movement of data in Azure Synapse Analytics or Azure Data Factory.
Real-time ingestion could be Apache Kafka for HDInsight or Stream Analytics.
What is Data Store in processing big data?
The store phase identifies where the ingested data should be place (Azure Data Lake Storage Gen2 for big data).
What is Prep and train phase in processing big data?
ID’s technologies that are used to perform data prep and model training. Azure Synapse Analytics, Azure Databricks, Azure HDInsight, Azure Machine Learning
What is Model and Serve in processing big data?
Present data to the users. Microsoft Power BI, Azure Synapse Analytics.
What is Azure Synapse Link?
Azure Synapse Link is a data integration feature that synchronizes operational data from services like Azure Cosmos DB, Azure SQL Database, SQL Server, and Microsoft Dataverse in near real-time for analytics in Azure Synapse.
What does Microsoft Purview do?
Microsoft Purview is a unified data governance solution that catalogs data assets—including those in Azure Synapse—so data engineers can easily discover data, understand lineage, and track it across pipelines.
What are common use cases for Azure Synapse Analytics?
Large-scale Data Warehousing
Advanced Analytics
Data Exploration
Real Time analytics
Data integration
What is a SQL Serverless Pool?
On-demand SQL query processing, primarily used for work with data in a data lake.
It is not good for transactional data requiring millisecond response times.