Azure Data Factory Flashcards by Keith Tobin

What does Azure Data Factory Do?

Enables data to be moved from internal or external services and application, transformed and stored in a final location.

How well did you know this?

Not at all

Perfectly

What does a typical orchestration look like for Azure Data Factory?

Dataset -> PipeLine -> OutputData -> LinkedService -> (Azure data lake, Block Storage, SQL)

How well did you know this?

Not at all

Perfectly

What is an Azure Data Factory Linked Service

Contains information needed to conect to external data sources (Like a SQL data connection string).

How well did you know this?

Not at all

Perfectly

What is an Azure Data Factory Gateway?

Connect your on-prem to Azure Cloud.
It consists of a client agent thet is installed on-prem and then connects to Azure Data Factory.

How well did you know this?

Not at all

Perfectly

What does Azure Data Factory help us perform?

Orchestrating the moving, transforming, and loading of data.

How well did you know this?

Not at all

Perfectly

What methods can we use to build the Azure Data Factory Pipelines

CLI
API
Powershell
Ci/CD
Portal

How well did you know this?

Not at all

Perfectly

What is an Azure Data Factory Pipeline?

It is a series of tasks like copying, transforming, and storing the data.

How well did you know this?

Not at all

Perfectly

I want a code method to create Azure Data Factory Data Pipelines. What are my options, and explain?

Use a CiCD such as GitHub or Azure DevOps to hold the code to create your Azure Data Factory Pipeline.

How well did you know this?

Not at all

Perfectly

How would you connect with an on-prem SQL in Azure Data Factory?

A linked service can connect with an SQL Database, a supported type.

How well did you know this?

Not at all

Perfectly

How would you connect with an on-prem SFTP using Azure Data Factory?

Use a link service in Azure Data Factory; it supports SQL.

How well did you know this?

Not at all

Perfectly

How would you connect with a CosmoDB database using Azure Data Factory?

Use a link service in Azure Data Factory; it supports CosmoDB.

How well did you know this?

Not at all

Perfectly

How would you connect with an REST API in Azure Data Factory?

Use a link service in Azure Data Factory; it supports REST API.

How well did you know this?

Not at all

Perfectly

List the supported link service types for Azure Data Factory?

Azure Blob Storage
Azure Data Lake Storage Gen1 and Gen2
Azure SQL Database
Azure Synapse Analytics (formerly SQL Data Warehouse)
Azure Cosmos DB
Amazon S3
Amazon Redshift
Google BigQuery
Oracle Database
SQL Server
MySQL
PostgreSQL
SAP HANA
Salesforce
REST
SFTP
File System

How well did you know this?

Not at all

Perfectly

Describe a linked service in Azure Data Factory.

It connects to external data like file systems, SQL, and SAP and is used to pull datasets into Azure Data Factory.

How well did you know this?

Not at all

Perfectly

What can we use to trigger a pipeline in Azure Data Factory?

Schedule Trigger: This allows you to run pipelines on a recurring schedule.

Tumbling Window Trigger: Useful for time-based workflows, executing pipelines at periodic time intervals.

Event-based Trigger: Responds to events, such as file creation or deletion in Azure Blob Storage.

Manual Trigger: Allows you to start a pipeline run on-demand.
Custom Events Trigger: Reacts to custom events published to an Azure Event Grid topic.

Storage Event Trigger: Responds to specific Azure Blob Storage or Azure Data Lake Storage Gen2 events.

REST API: Programmatically trigger pipelines using the ADF REST API.

PowerShell: Use Azure PowerShell cmdlets to trigger pipeline runs.

Azure CLI: Trigger pipelines using Azure Command-Line Interface commands.

How well did you know this?

Not at all

Perfectly

I require the ability to self host my

What is an integration

What is an Integrated Runtime for Azure Data Factory?

It is the Azure Data Factory Runtime that performs pipeline actions like flow, transform, movement, etc.

List the types of Integrated Runtime for Azure Data Factory?

Azure managed
Self-hosted
SSIS

What must we have from an auth perspective when using Azure Data Factory to access other Azure services? This should be the least-managed type.

Managed Identity

The Azure Data Factor Managed Identity is required to provide authorization to access the services needed.

What types of destinations can i send Azure Data Fastors dataset to?

Azure Data Storage:

Azure Blob Storage
Azure Data Lake Storage Gen1 and Gen2
Azure Files

Azure Databases:

Azure SQL Database
Azure Synapse Analytics (formerly SQL Data Warehouse)
Azure Database for MySQL
Azure Database for PostgreSQL
Azure Database for MariaDB
Azure Cosmos DB

Other Microsoft Services:

Microsoft Dynamics 365
Power BI

Relational Databases:

SQL Server (on-premises or on Azure VMs)
Oracle Database
IBM DB2

NoSQL Databases:

MongoDB (on-premises or Azure Cosmos DB’s API for MongoDB)

File Systems:

HDFS (Hadoop Distributed File System)

Generic Protocols:

OData
ODBC

Analytics Platforms:

Azure Databricks
Azure HDInsight (Hadoop, Spark, etc.)

SaaS Applications:

Salesforce
SAP HANA

Can Data Factory provide, SQL Server Integration Services (SSIS)?

SQL Server Integration Services (SSIS) is a data integration and workflow automation tool that is part of Microsoft’s SQL Server. It is specifically designed for building high-performance data integration, transformation, and migration solutions, making it popular in data warehousing and data migration projects. Here’s an overview of what SSIS does and its key components:

Data Extraction, Transformation, and Loading (ETL)
Extraction: SSIS connects to multiple data sources, including SQL databases, flat files, Excel sheets, Oracle, and other OLE DB/ODBC-compliant sources. It pulls data from these sources to begin ETL workflows.
Transformation: Once data is extracted, SSIS applies various transformations to cleanse, reshape, and standardize the data. Transformations include data filtering, sorting, aggregation, and applying custom logic (e.g., derived columns).
Loading: The transformed data is then loaded into a target system, which could be another database, a data warehouse, or other destination types like files or APIs.
Workflow Automation
SSIS allows users to define complex workflows to automate tasks like data transfers, database maintenance, and sending notifications. Workflows are visually created using the SSIS designer in SQL Server Data Tools (SSDT), which uses a drag-and-drop interface for assembling data flows and task-based workflows.

When to use default Azure integration runtime over Self-Hosted Integration Runtime?

Use the default Azure Integration Runtime when:

Accessing Publicly Accessible Data Stores: If your data sources and destinations are accessible over the public internet without firewall restrictions, the Azure IR is suitable. It provides a fully managed, serverless compute environment that handles data movement and transformation tasks efficiently.

Cloud-to-Cloud Data Movement: For data transfers between cloud-based services, such as Azure Blob Storage and Azure SQL Database, the Azure IR offers optimal performance and scalability.

Minimal Maintenance Needs: Azure IR is managed by Microsoft, eliminating the need for manual updates or infrastructure maintenance, which is beneficial if you prefer a hands-off approach.

When to use Self-Hosted Integration Runtime over default Azure integration runtime?

Use a Self-Hosted Integration Runtime when: Accessing On-Premises or Private Network Data Stores: SHIR is necessary if your data resides behind firewalls, within on-premises environments, or in private networks. It enables secure data integration by installing the runtime on a machine within your network, facilitating communication with ADF or Synapse. Custom Component or Driver Requirements: When your data integration tasks require specific components, such as custom ODBC drivers or Java Runtime Environment (JRE), SHIR allows you to install these on the host machine, providing the necessary flexibility. Static IP Address Needs: If your data sources require access from a known static IP address, deploying SHIR on a machine with a fixed IP ensures consistent connectivity, which is crucial for certain security configurations.