Ch. 1 - Get started with data engineering on Azure Flashcards
What is Data Integration?
Establishing links between data sources to enable access to data across multiple systems.
What is data transformation?
Transforming operational data into suitable structure and format for analysis. Often part of extract, transform, and load process.
What is data consolidation?
Combining data that has been extracted from multiple data sources into a consistent structure - supports analytics and reporting.
What is operational data?
Operational data is typically transactional data that is generated and stored by apps, often in a non-relational or relational database.
What is analytical data?
Analytical data has been optimized for analysis and reporting, often in a data warehouse.
What is streaming data?
Refers to perpetual sources of data that generate data values in-real time, relating to specific events (IoT).
What is a data lake?
Storage repository that holds large amounts of data in native, raw formats. Optimized for scaling to MASSIVE volumes, comes from multiple heterogeneous sources, may be structured, semi-structured, or unstructured.
Store everything in its original, untransformed state.
What is a data warehouse?
Centralized repo of integrated data from one or more disparate sources. Stores current and historical data in relational tables that are organized into a schema that optimizes performance for analytical queries.
Data engineers are responsible for designing and implementing relational data warehouses, and managing regular data loads into tables.
What is Apache Spark?
Parallel processing framework that takes advantage of in-memory processing and a distributed file storage.
Data engineers need to be proficient with Spark, using notebooks and other code artifacts to process data in a data lake and prepare it for modeling and analysis.
How is Azure Data Lake Gen2 Hadoop compatible?
- Can treat data as if its HDFS, stored in one location, and access it via compute tech. (Azure Databricks, Azure HDInsight, and Azure Synapse Analytics) without moving the data. Also have access to parquet (columnar) format.
Explain ADLG2 Security Features
- Supports Access Control Lists (ACLs) and Portable Operating System Interface (POSIX) permissions that don’t inherit the permissions of the parent dir. Security is configurable via Hive, Spark, or Azure Storage Explorer.
Explain ADLG2 File Storage
- Stores data into a hierarchy of directories, like a file system for ease of navigation.
Explain ADLG2 Data Redundancy
- Data redundancy - Data Lake Storage takes advantage of Azure Blob replication models with LRS / GRS options.
How are Blob files stored?
Blobs allow for large amounts of unstructured (“object”) data in a flat namespace within a blob container. Names include “/” characters to organize blobs into virtual “folders”, but are actually stored as a single-level hierarchy in a flat namespace. Accessed via HTTP/HTTPS.
How does Azure Data Lake Storage Gen2 compare to Azure Blobs?
Azure Data Lake Storage Gen2 builds on blob storage and optimizes I/O of high-volume data by using a hierarchical namespace that organizes blob data into directories, and stores metadata about each directory and the files within it.
This structure allows operations, such as directory renames and deletes, to be performed in a single atomic operation.