DP-203 Flashcards

Exam Prep

1
Q

Case Study Series of questions that represent the same scenario

Your company supports a software as a service (SaaS) application with a large number of customers. You are designing a support database to use elastic pools and elastic queries to retrieve customer information. Customer information is stored in a table that includes values CustomerlD and RegionalID.
You need to partition data to optimize queries by customer sorted by geographic location. T solution should minimize support costs.

Solution: You configure horizontal partitioning based on CustomerID.

Does this solution meet the goal?

A. Yes
B. No

A

B. No

This solution does not meet the goal. If you configure sharding by CustomerlD, this will provide worse read performance as we need to query based on cuAtomer sorted by geographic location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Case Study Series of questions that represent the same scenario

Your company supports a software as a service (SaaS) application with a large number of customers. You are designing a support database to use elastic pools and elastic queries to retrieve customer information. Customer information is stored in a table that includes values CustomerlD and RegionalID.

You need to partition data to optimize queries by customer sorted by geographic location. T solution should minimize support costs.

Solution: You configure vertical partitioning based on CustomerID.

Does this solution meet the goal?

A. Yes
B. No

A

B. No

This solution does not meet the goal. Vertical partitioning is a way to organize data columns and can be used to retrieve data through JOIN operations more erricient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Case Study Series of questions that represent the same scenario

Your company supports a software as a service (SaaS) application with a large number of customers. You are designing a support database to use elastic pools and elastic queries to retrieve customer information. Customer information is stored in a table that includes values CustomerlD and RegionalID.

You need to partition data to optimize queries by customer sorted by geographic location. T solution should minimize support costs.

Solution: You configure sharding by RegionalID.

Does this solution meet the goal?

A. Yes
B. No

A

A. Yes

This solution meets the goal. Sharding by RegionallD, which would segregate items by region, would help optimize sorting by geographic area. Another reason you might want to use RegionallD for sharding is that it might be necessary to have different parts of the database configured to reside in different geographies for compliance or geopolitical reasons.
Sharding partitions data horizontally to distribute data across multiple databases in a scaled-out design. This requires that the schema is the same on all of the databases involved. Sharding helps to minimize the size of individual databases, which in turn helps to improve transactional process performance. Hardware support requirements are minimized, which helps to reduce related costs. Elastic queries let you run queries across multiple shards. You can configure and manage sharding through the elastic database tools libraries or through self-sharding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Case Study Series of questions that represent the same scenario

You have a data pipeline in Azure Synapse Analytics that ingests data from multiple sources.
You need to process personal information such as first name, last name, and social security number (SSN). Because the data comes from different sources, it contains duplicate records and it also has other issues, such as:

  • In some of tie records, the names are in mixed case (Customer), while in others, they are in uppercase (CUSTOMER).
  • The SSNs are sometimes hyphenated (123-45-6789), while others are missing the hyphens (123456789).

You need to remove the duplicates from the input data and normalize the records.

Solution: You transform the data using Spark pool with the Scala code below:

val dedupedDF = inputDataFrame. dropDuplicates ()

Does this solution meet the goal?

A. Yes
B. No

A

B. No

This solution does not meet the goal. This would remove the exact duplicates from the data frame only. The dataset may still contain records with names in mixed case or uppercase, or SSN numbers with or without hyphens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Case Study Series of questions that represent the same scenario

You have a data pipeline in Azure Synapse Analytics that ingests data from multiple sources.
You need to process personal information such as first name, last name, and social security number (SS). Because the data comes from different sources, it contains duplicate records and it also has other issues, such as:

  • In some of the records, the names are in mixed case (Customer), while in others, they are in uppercase (CUSTOMER).
  • The SSNs are sometimes hyphenated (123-45-6789), while others are missing the hyphens (123456789).

You need to remove the duplicates from the input data and normalize the records.

Solution: You transform the data using Spark pool with the Python code below:

from pyspark. sql. functions import *
dedupedDF = (dataFrame
.select (col ("*"),
    upper (col ("firstName")) alias ("upperFirstName") ,
    upper (col ("lastName")) alias ("upperLastName"),
    regexp_replace (col ("ssn"), "-", ""). alias ("ssnOnlyNumber")

)
.dropDuplicates (["upperFirstName", "upperLastName", "ssnOnlyNumber" ])
)

Does this solution meet the goal?
A. Yes
B. No

A

A. Yes

This solution meets the goal. This code will first normalize the first and last names by putting everything in uppercase:
~~~

upper (col (“firstName”)).alias (“upperFirstName”)
upper (col (“lastName”)).alias (“upperLastName” )
~~~

Then it will remove the hyphens from SSN numbers:

regexp_replace (col ("ssn"),."-", "").alias ("ssnOnlyNumber")

And finally, it will remove the duplicate records:

.dropDuplicates (["upperFirstName", "upperLastName", "ssnOnlyNumber" ])

Without these transformation steps, the resulting dataset may still contain duplicates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Case Study Series of questions that represent the same scenario

You have a data pipeline in Azure Synapse Analytics that ingests data from multiple sources.
You need to process personal information such as first name, last name, and social security number (SSN). Because the data comes from different sources, it contains duplicate records and it also has other issues, such as:
* In some of the records, the names are in mixed case (Customer), while in others, they are in uppercase (CUSTOMER).
* The SSNs are sometimes hyphenated (123-45-6789), while others are missing the hyphens (123456789).

You need to remove the duplicates from the input data and normalize the records.

Solution: You transform the data using SQL pool with the SQL query below:

WITH deduplicateData AS
SELECT
UPPER ( [FirstName]) as UpperFirstName,
UPPER ( [LastName]) as
UpperLastName,
CAST ( [SSN] as INT) as NumberSSN
FROM [CustomerDataExternalTable]
)*
SELECT * FROM deduplicateData
GROUP BY UpperFirstName, UpperLastName, NumberSSN

Does this solution meet the goal?
A. Yes
B. No

A

B. No

This solution does not meet the goal, This query will fail to convert SSN records containing hyphens. You should use the REPLACE( function to perform this step. Shown below is the correct SQL query that would meet all the requirements:

WITH deduplicateData AS ("
SELECT
UPPER ( [FirstName]) as UpperFirstName,
UPPER ( [LastName])
as UpperLastName,
REPLACE ( [SSN], '-', ') as NumberSSN
FROM [CustomerDataExternalTable]
SELECT * FROM deduplicateData
GROUP BY UpperFirstName, UpperLastName, NumberSSN
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Case Study Series of questions that represent the same scenario

You work as an Azure data engineer.

You need to transform an incoming JSON file into a relational structure using Azure Data
Factory.

Solution: You use the flatten transformation in a mapping data flow.

Does this solution meet the goal?
A. Yes
B. No

A

A. Yes

This solution meets the goal. In a process called denormalization, the flatten transformation can take hierarchically structured files as an input and unroll them into individual rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Case Study Series of questions that represent the same scenario

You work as an Azure data engineer.

You need to transform an incoming JSON file into a relational structure using Azure Data
Factory.

Solution: You use the unpivot transformation in a mapping data flow.

Does this solution meet the goal?
A. Yes
B. No

A

B. No

This solution does not meet the goal. The unpivot transformation is used to normalize datasets by converting columns into rows. You can transform the selected set of columns into rows, making the data useful for relational data processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Case Study Series of questions that represent the same scenario

You work as an Azure data engineer.

You need to transform an incoming JSON file into a relational structure using Azure Data
Factory.

Solution: You use the pivot transformation in a mapping data flow.

Does this solution meet the goal?
A. Yes
B. No

A

B. No

This solution does not meet the goal. The pivot transformation is used to denormalize datasets by converting rows into columns. You can transform the unique set of rows into columns, making the data useful for analytical data processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Case Study Series of questions that represent the same scenario

You work with an Azure Synapse Analytics solution.
You need to transform an incoming stream based on the following criteria:
* Rows prior to the year 2000 need to be assigned to an output stream, named Processed.
* Rows after the year 2000 need to be assigned to another stream, named Unprocessed, for further processing.

Solution: You use the filter transformation in a mapping data flow.

Does this solution meet the goal?
A. Yes
B. No

A

B. No

This solution does not meet the goal. The filter transformation is used to filter out rows based on specific criteria. It returns a single output stream which matches the filter condition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Case Study Series of questions that represent the same scenario

You work with an Azure Synapse Analytics solution.
You need to transform an incoming stream based on the following criteria:
* Rows prior to the year 2000 need to be assigned to an output stream, named Processed.
* Rows after the year 2000 need to be assigned to another stream, named Unprocessed, for further processing.

Solution: You use the conditional split transformation in a mapping data flow.

Does this solution meet the goal?
A. Yes
B. No

A

A. Yes

This solution meets the goal. The conditional split transformation in mapping data flow allows you to configure conditions for each output stream. You can define multiple output streams using different conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Case Study Series of questions that represent the same scenario

You work with an Azure Synapse Analytics solution.
You need to transform an incoming stream based on the following criteria:
* Rows prior to the year 2000 need to be assigned to an output stream, named Processed.
* Rows after the year 2000 need to be assigned to another stream, named Unprocessed, for further processing.

Solution: You use the lookup transformation in a mapping data flow.

Does this solution meet the goal?
A. Yes
B. No

A

B. No

This solution does not meet the goal. The lookup transformation is used to reference data from another source. It appends columns from matched data to your source data, similarly to the LEFT JOIN keyword in SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Case Study Series of questions that represent the same scenario

You are a data engineer. You are developing a data ingestion solution that ingests data from large pipe-delimited text files in an Azure Data Lake Storage account to a dedicated pool in Azure Synapse Analytics.
You need to load the data.
Solution:
* Create an external file format and an external data source.
* Create an external table that uses the external data source.
* Load the data from the external table.
0 of 72

Does this solution meet the goal?
A. Yes
B. No

A

A. Yes

The solution meets the goal. To import data into Azure Synapse Analytics, you should first create an external file format by using the CREATE EXTERNAL FILE FORMAT statement. This defines the type of file that represents the source data.

Next, you should create an external data source by using the CREATE EXTERNAL DATA SOURCE statement. This specifies the location and credentials to the Azure Data Lake Storage account.

Then, you should create an external table by using the CREATE EXTERNAL TABLE statement.
This defines the table fields, specifies its location in the storage account, and the file format that you created previously.

Finally, you should load data into the table by using CREATE TABLE AS SELECT, which allows you to write a query that selects data from the source file and place it in a new table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Case Study Series of questions that represent the same scenario

You are a data engineer. You are developing a data ingestion solution that ingests data from large pipe-delimited text files in an Azure Data Lake Storage account to a dedicated pool in Azure Synapse Analytics.
You need to load the data.
Solution:
* Create an Azure Databricks account and a linked server.
* Create an external table that points to the Azure Databricks account.
* Load the data by running the dbutils.fs.cp command.

Does this solution meet the goal?
A. Yes
B. No

A

B. No

This solution does not meet the goal. Azure Databricks uses Spark clusters to execute code in notebooks. You should not create a linked server to connect to a Databricks account. The sp_addlinkedserver stored procedure allows you to connect to other SQL Server instances. The dbutils.fs.cp command allows you to copy files in Databricks. Because you need to load data into a dedicated pool, not Databricks, you should not run this command.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Case Study Series of questions that represent the same scenario

You are a data engineer. You are developing a data ingestion solution that ingests data from large pipe-delimited text files in an Azure Data Lake Storage account to Azure Synapse Analytics.
You need to load the data.
Solution:
* Create an Azure Cosmos DB account and a linked server.
* Create an external table that points to the Azure Cosmos DB account.
* Load the data by running the BULK IMPORT statement.

Does this solution meet the goal?
A. Yes
B. No

A

B. No

This solution does not meet the goal. Azure Cosmos DB is a multi-model, non-relational database that uses one of five APIs: SQL, Table, Cassandra, MongoDB, and Gremlin. You should not create a linked server to connect to a Cosmos DB account. The sp_addlinkedserver stored procedure allows you to connect to other SQL Server instances. BULK IMPORT allows you to bulk import data, but this command cannot import data from a Cosmos DB account.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Case Study Series of questions that represent the same scenario

Background
Company A develops inventory management software. Its flagship product allows employees to scan product barcodes in stores and have that data delivered to a central repository: A website allows supervisors to view the data sent to the repository:

Inventory Data
Inventory data consists of the following fields:
* UPC
* Name
* Description
* Quantity
* Store ID
* Aisle Number
* Price
* Expiration Date

Technical Solution
Inventory data is currently stored in two Microsoft SQL Server databases. One database resides in California, and the other database resides in New York. In total, over 200 TB of data is stored across the two databases. The scanners submit inventory data to an application server over HTTPS. A service on the application server then analyzes the data and sends it to the databases.
The new solution must allow processing of the inventory data in batches every hour. After the data is processed, it must be kept for at least two years. It must be stored in such a way that parallel queries can be run against the data.
Business stakeholders must be able to graphically visualize the data without writing any code. Data engineers must be able to graphically visualize the data by using Python.

Data Engineering Requirements

The data engineers at your company are familiar with C#, Python, and SQL.
Any recommended solution must take advantage of their existing skills.

Question
You need to design a solution for storing the initial inventory data.
Which resource should you use?

A. Event Hub
B. Azure SQL Data Warehouse
C. Azure Data Lake
D. Power BI

A

C. Azure Data Lake

You should use Azure Data Lake. This is a big data store that supports data of any type and size. It supports receiving data in batches.
You should not use Event Hub. Event Hub supports the streaming of millions of events per second. You can submit events to Event Hub by using HTTPS or Advanced Message Queuing Protocol (AMQP).
You should not use Azure SQL Data Warehouse. This is a big data Enterprise Data Warehouse (EDW) that processes data in parallel across 60 compute nodes.
You should not use Power Bl. Power Bl allows you to turn data into visual insights.

17
Q

Case Study Series of questions that represent the same scenario

Background
Company A develops inventory management software. Its flagship product allows employees to scan product barcodes in stores and have that data delivered to a central repository: A website allows supervisors to view the data sent to the repository:

Inventory Data
Inventory data consists of the following fields:
* UPC
* Name
* Description
* Quantity
* Store ID
* Aisle Number
* Price
* Expiration Date

Technical Solution
Inventory data is currently stored in two Microsoft SQL Server databases. One database resides in California, and the other database resides in New York. In total, over 200 TB of data is stored across the two databases. The scanners submit inventory data to an application server over HTTPS. A service on the application server then analyzes the data and sends it to the databases.
The new solution must allow processing of the inventory data in batches every hour. After the data is processed, it must be kept for at least two years. It must be stored in such a way that parallel queries can be run against the data.
Business stakeholders must be able to graphically visualize the data without writing any code. Data engineers must be able to graphically visualize the data by using Python.

Data Engineering Requirements

The data engineers at your company are familiar with C#, Python, and SQL.
Any recommended solution must take advantage of their existing skills.

Question
You need to design a solution for storing the data in the long term.
Which resource should you use?

A. Azure Data Factory
B. Azure SQL Data Warehouse
C. Azure Databricks
D. Azure SQL Database

A

B. Azure SQL Data Warehouse

You should use Azure SQL Data Warehouse. This is a big data Enterprise Data Warehouse (EDW) that processes data in parallel across 60 compute nodes.

You should not use Azure Data Factory. This is a cloud service that allows you to extract, transform, and load big data. It only stores pipeline data for 45 days.

You should not Nse Azure Databricks. This is a platform based on Apache Spark clusters and it allows you to create big data workflows.

You should not use Azure SQL Database. This is a platform as a service (PaaS) offering of SQL Server. Azure SQL Database does not allow parallel processing of big data.

18
Q

Case Study Series of questions that represent the same scenario

Background
Company A develops inventory management software. Its flagship product allows employees to scan product barcodes in stores and have that data delivered to a central repository: A website allows supervisors to view the data sent to the repository:

Inventory Data
Inventory data consists of the following fields:
* UPC
* Name
* Description
* Quantity
* Store ID
* Aisle Number
* Price
* Expiration Date

Technical Solution
Inventory data is currently stored in two Microsoft SQL Server databases. One database resides in California, and the other database resides in New York. In total, over 200 TB of data is stored across the two databases. The scanners submit inventory data to an application server over HTTPS. A service on the application server then analyzes the data and sends it to the databases.
The new solution must allow processing of the inventory data in batches every hour. After the data is processed, it must be kept for at least two years. It must be stored in such a way that parallel queries can be run against the data.
Business stakeholders must be able to graphically visualize the data without writing any code. Data engineers must be able to graphically visualize the data by using Python.

Data Engineering Requirements

The data engineers at your company are familiar with C#, Python, and SQL.
Any recommended solution must take advantage of their existing skills.

Question
You need to design a solution to allow data engineers to visualise the data.

Which resource should you use?

A. Azure Databricks
B. Stream Analytics
C. Azure Data Lake
D. Power BI

A

A. Azure Databricks

You should use Azure Databricks. This is a platform based on Apache Spark clusters and it allows you to create big data workflows. Data engineers can visualize data by using Python.Databricks also supports Scala, R, and SQL.

You should not use Power Bl. Power Bl allows you to turn data into visual insights. It allows business stakeholders to perform this task without writing any code.

You should not use Stream Analytics. This is a real-time event processing service that processes events simultaneously from different sources.

You should not use Azure Data Lake. This is a repository for big data workloads that supports data of any type and size.

19
Q

Case Study Series of questions that represent the same scenario

Background
Company A develops inventory management software. Its flagship product allows employees to scan product barcodes in stores and have that data delivered to a central repository: A website allows supervisors to view the data sent to the repository:

Inventory Data
Inventory data consists of the following fields:
* UPC
* Name
* Description
* Quantity
* Store ID
* Aisle Number
* Price
* Expiration Date

Technical Solution
Inventory data is currently stored in two Microsoft SQL Server databases. One database resides in California, and the other database resides in New York. In total, over 200 TB of data is stored across the two databases. The scanners submit inventory data to an application server over HTTPS. A service on the application server then analyzes the data and sends it to the databases.
The new solution must allow processing of the inventory data in batches every hour. After the data is processed, it must be kept for at least two years. It must be stored in such a way that parallel queries can be run against the data.
Business stakeholders must be able to graphically visualize the data without writing any code. Data engineers must be able to graphically visualize the data by using Python.

Data Engineering Requirements

The data engineers at your company are familiar with C#, Python, and SQL.
Any recommended solution must take advantage of their existing skills.

Question
You need to design a solution to allow business stakeholders to visualise the data.

Which resource should you use?

A. Azure Databricks
B. Stream Analytics
C. Azure Data Lake
D. Power BI

A

D. Power BI

You should use Power Bl. Power Bl allows you to turn data into visual insights. It will allow the business stakeholders to perform this task without writing any code.

You should not use Stream Analytics. This is a real-time event processing service that processes events simultaneously from different sources.

You should not use Azure Data Lake. This is a repository for big data workloads that supports data of any type and size.

You should not use Azure Databricks. This is a platform based on Apache Spark clusters and it allows you to create big data workflows.

20
Q

Case Study Series of questions that represent the same scenario

You are using Azure Stream Analytics (ASA) to process real-time defect detection events from your factory’s production lines.

You need to select the right built-in windowing function in your ASA job’s SELECT query, so that you can:

  • Group events per line if they occur within specific time intervals between each other, but not exceeding the maximum duration time set for the window.
  • Filter out periods of time when no defects are reported.
  • Count each event only once.

Solution: In the ASA job query, group events by using the session window.

Does this solution meet the goal?
A. Yes
B. No

A

A. Yes

This solution meets the goal. Session windows begin when the defect detection event occurs, and they continue to extend, including new events occurring within the set time interval (timeout). If no further events are detected, then the window will close. The window will also close if the maximum duration parameter is set for the session window, and then a new session window may begin.

The session window option will effectively filter out periods of time where no events are streamed. Each event is only counted once.

21
Q

Case Study Series of questions that represent the same scenario

You are using Azure Stream Analytics (ASA) to process real-time defect detection events from your factory’s production lines.

You need to select the right built-in windowing function in your ASA job’s SELECT query, so that you can:

  • Group events per line if they occur within specific time intervals between each other, but not exceeding the maximum duration time set for the window.
  • Filter out periods of time when no defects are reported.
  • Count each event only once.

Solution: In the ASA job query, group events by using the tumbling window.

Does this solution meet the goal?
A. Yes
B. No

A

B. No

This solution does not meet the goal. Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. Each event is only counted once. However, they do not check the time duration between events and they do not filter out periods of time when no events are streamed

22
Q

Case Study Series of questions that represent the same scenario

**You are using Azure Stream Analytics (ASA) to process real-time defect detection events from your factory’s production lines.

You need to select the right built-in windowing function in your ASA job’s SELECT query, so that you can:

  • Group events per line if they occur within specific time intervals between each other, but not exceeding the maximum duration time set for the window.
  • Filter out periods of time when no defects are reported.
  • Count each event only once.

Solution: In the ASA job query, group events by using the sliding window.

Does this solution meet the goal?
A. Yes
B. No

A

B. No

This solution does not meet the goal. Sliding windows are a series of fixed-sized and contiguous time intervals. They produce output only when an event occurs, so you can filter out periods of times where no events are streamed. However, they may overlap and that is why an event may be included in more than one window. Sliding windows also do not check the time duration between events.

23
Q

Case Study Series of questions that represent the same scenario

You are a data engineer for an autonomous vehicle manufacturer. Each vehicle contains a transmitter that submits sensor data over Advanced Message Queuing Protocol (AMQP). You want to retrieve the sensor data in real time so that you can extract relevant information, transform it, and then send it to Power BI.
You need to implement the solution.’
Solution:
* Create an Event Hub instance:
* Create a Stream Analytics job that uses a query to extract data.

Does this solution meet the goal?
A. Yes
B. No

A

A. Yes

This solution meets the goal. Event Hubs is an Azure resource that allows you to stream big data to the cloud. It accepts streaming data over HTTPS and AMQP. A Stream Analytics job can read data from loT Hub and store the transformed data in a variety of output data sources, including Power BI.

24
Q
A