Ch. 1 - Get started with data engineering on Azure Flashcards

1
Q

What is Data Integration?

A

Establishing links between data sources to enable access to data across multiple systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data transformation?

A

Transforming operational data into suitable structure and format for analysis. Often part of extract, transform, and load process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is data consolidation?

A

Combining data that has been extracted from multiple data sources into a consistent structure - supports analytics and reporting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is operational data?

A

Operational data is typically transactional data that is generated and stored by apps, often in a non-relational or relational database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is analytical data?

A

Analytical data has been optimized for analysis and reporting, often in a data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is streaming data?

A

Refers to perpetual sources of data that generate data values in-real time, relating to specific events (IoT).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a data lake?

A

Storage repository that holds large amounts of data in native, raw formats. Optimized for scaling to MASSIVE volumes, comes from multiple heterogeneous sources, may be structured, semi-structured, or unstructured.

Store everything in its original, untransformed state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a data warehouse?

A

Centralized repo of integrated data from one or more disparate sources. Stores current and historical data in relational tables that are organized into a schema that optimizes performance for analytical queries.

Data engineers are responsible for designing and implementing relational data warehouses, and managing regular data loads into tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Apache Spark?

A

Parallel processing framework that takes advantage of in-memory processing and a distributed file storage.

Data engineers need to be proficient with Spark, using notebooks and other code artifacts to process data in a data lake and prepare it for modeling and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is Azure Data Lake Gen2 Hadoop compatible?

A
  1. Can treat data as if its HDFS, stored in one location, and access it via compute tech. (Azure Databricks, Azure HDInsight, and Azure Synapse Analytics) without moving the data. Also have access to parquet (columnar) format.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain ADLG2 Security Features

A
  1. Supports Access Control Lists (ACLs) and Portable Operating System Interface (POSIX) permissions that don’t inherit the permissions of the parent dir. Security is configurable via Hive, Spark, or Azure Storage Explorer.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain ADLG2 File Storage

A
  1. Stores data into a hierarchy of directories, like a file system for ease of navigation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain ADLG2 Data Redundancy

A
  1. Data redundancy - Data Lake Storage takes advantage of Azure Blob replication models with LRS / GRS options.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How are Blob files stored?

A

Blobs allow for large amounts of unstructured (“object”) data in a flat namespace within a blob container. Names include “/” characters to organize blobs into virtual “folders”, but are actually stored as a single-level hierarchy in a flat namespace. Accessed via HTTP/HTTPS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does Azure Data Lake Storage Gen2 compare to Azure Blobs?

A

Azure Data Lake Storage Gen2 builds on blob storage and optimizes I/O of high-volume data by using a hierarchical namespace that organizes blob data into directories, and stores metadata about each directory and the files within it.

This structure allows operations, such as directory renames and deletes, to be performed in a single atomic operation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the four stages of processing big data?

A

Ingest, Store, Prep and Train, Model and Serve

Data lakes have a fundamental role in a wide range of big data architectures. These architectures can involve the creation of:

An enterprise data warehouse.
Advanced analytics against big data.
A real-time analytical solution.

17
Q

What is Data Ingestion in processing big data?

A

Ingest - acquire source data. I.e. batch movement of data in Azure Synapse Analytics or Azure Data Factory.

Real-time ingestion could be Apache Kafka for HDInsight or Stream Analytics.

18
Q

What is Data Store in processing big data?

A

The store phase identifies where the ingested data should be place (Azure Data Lake Storage Gen2 for big data).

19
Q

What is Prep and train phase in processing big data?

A

ID’s technologies that are used to perform data prep and model training. Azure Synapse Analytics, Azure Databricks, Azure HDInsight, Azure Machine Learning

20
Q

What is Model and Serve in processing big data?

A

Present data to the users. Microsoft Power BI, Azure Synapse Analytics.

21
Q

What is Azure Synapse Link?

A

Azure Synapse Link is a data integration feature that synchronizes operational data from services like Azure Cosmos DB, Azure SQL Database, SQL Server, and Microsoft Dataverse in near real-time for analytics in Azure Synapse.

22
Q

What does Microsoft Purview do?

A

Microsoft Purview is a unified data governance solution that catalogs data assets—including those in Azure Synapse—so data engineers can easily discover data, understand lineage, and track it across pipelines.

23
Q

What are common use cases for Azure Synapse Analytics?

A

Large-scale Data Warehousing
Advanced Analytics
Data Exploration
Real Time analytics
Data integration

24
Q

What is a SQL Serverless Pool?

A

On-demand SQL query processing, primarily used for work with data in a data lake.

It is not good for transactional data requiring millisecond response times.

25
Q

What is a SQL Dedicated Pool?

A

Enterprise-scale relational database instances used to host data warehouses in which data is stored in relational tables.

26
Q

What does the OPENROWSET syntax do?

A

Enables querying of various serverless SQL pool formats (CSVs, JSON, and Parquet files)

SELECT TOP 100 *
FROM OPENROWSET(
BULK ‘https://mydatalake.blob.core.windows.net/data/files/*.csv’,
FORMAT = ‘csv’,
PARSER_VERSION = ‘2.0’,
FIRSTROW = 2) AS rows

The PARSER_VERSION is used to determine how the query interprets the text encoding used in the files. Version 1.0 is the default and supports a wide range of file encodings, while version 2.0 supports fewer encodings but offers better performance. The FIRSTROW parameter is used to skip rows in the text file, to eliminate any unstructured preamble text or to ignore a row containing column headings.

https://learn.microsoft.com/en-us/training/modules/query-data-lake-using-azure-synapse-serverless-sql-pools/3-query-files

27
Q

What is Azure Blob storage?

A

Used to store unstructured data such as videos, audio, metadata, log files, text, and binary.

It can be accessed via Representation State Transfer (REST) - one way of implementing web service endpoints.

28
Q

What is a Hierarchical namespace?

A

Means that all directories within a storage account will contain metadata describing their structure and content which improves operations such as moving an entire directory structure to a different parent directory, as only the metadata will need to be updated.

29
Q

Difference between Blob Storage and ADLS Gen2?

A

This makes operations such as renaming and deleting directories atomic and quick. For example, if you have 100 files under a directory in Blob Storage, renaming that directory would require 100 metadata operations.

30
Q

What are Azure Files?

A

Azure Files offers fully managed file shares in the cloud that are accessible via the industry standard Server Message Block (SMB) protocol, Network File System (NFS) protocol, and Azure Files REST API. Azure file shares can be mounted concurrently by cloud or on-premises deployments. SMB Azure file shares are accessible from Windows, Linux, and macOS clients.

File shares are provisioned via Azure Storage Account V2 or premium File Share Tier.

31
Q

What are Azure Queues?

A

Used to store a large number of messages that can be accessed asynchronously between the source and the destination.

Asynchronously means a sender can add a message to the queue and the receiver can retrieve it later when ready

32
Q

What is Azure Queue Storage?

A

Storage queues can be used ​for simple asynchronous message processing. They can stor​e up to 500 TB of data (per storage account) and each message can be up to 64 KB in size

You access messages from anywhere in the world via authenticated calls using HTTP or HTTPS. A queue may contain millions of messages, up to the total capacity limit of a storage account. Queues are commonly used to create a backlog of work to process asynchronously.

33
Q

How is Azure Service Bus used?

A

A fully managed enterprise messaging service used to reliably transfer business data (sales, purchases, inventory movements, journals) among decoupled systems through Azure Queues.

34
Q

What are Azure Tables?

A

Azure tables refer to key-value stores for structured, non-relational data,.

Azure Table Storage = cost-effective basic option

Azure Cosmos DB (the premium service) offering advanced features like global distribution, flexible consistency models, and serverless compute

35
Q

What are Azure Managed Disks?

A

Virtual hard disks that are mounted to an Azure VM, available standard HDD, standard SSD, premium SSD, and ultra disks.

36
Q

What are Azure VNets?

A

Ties all resources such as VMs, storage accounts, and databases together securely in a private network.

Provides 4 main services:

Security: provides secure connectivity within azure, using basic VNet, VNet Peering, and Service endpoints.

Networking: Provides networking beyond the Azure Cloud and into the internet and hybrid clouds using express routes, private endpoints, and point-to-site and site-to-site VPNs.

Filtering: Provides networking filtering/firewalls that can implemented either via network or app security groups.

Routing: provides network routing abilities that allow configuration network routes using route tables and Border Gateway Protocol.

37
Q

What is Azure Synapse?

A

Azure Synapse is Microsoft’s unified analytics platform that integrates data ingestion, data warehousing, and big data analytics in one environment, enabling teams to ingest, transform, and analyze data at scale

38
Q
A