DP-203 Flashcards
Exam Prep
Case Study Series of questions that represent the same scenario
Your company supports a software as a service (SaaS) application with a large number of customers. You are designing a support database to use elastic pools and elastic queries to retrieve customer information. Customer information is stored in a table that includes values CustomerlD and RegionalID.
You need to partition data to optimize queries by customer sorted by geographic location. T solution should minimize support costs.
Solution: You configure horizontal partitioning based on CustomerID.
Does this solution meet the goal?
A. Yes
B. No
B. No
This solution does not meet the goal. If you configure sharding by CustomerlD, this will provide worse read performance as we need to query based on cuAtomer sorted by geographic location.
Case Study Series of questions that represent the same scenario
Your company supports a software as a service (SaaS) application with a large number of customers. You are designing a support database to use elastic pools and elastic queries to retrieve customer information. Customer information is stored in a table that includes values CustomerlD and RegionalID.
You need to partition data to optimize queries by customer sorted by geographic location. T solution should minimize support costs.
Solution: You configure vertical partitioning based on CustomerID.
Does this solution meet the goal?
A. Yes
B. No
B. No
This solution does not meet the goal. Vertical partitioning is a way to organize data columns and can be used to retrieve data through JOIN operations more erricient.
Case Study Series of questions that represent the same scenario
Your company supports a software as a service (SaaS) application with a large number of customers. You are designing a support database to use elastic pools and elastic queries to retrieve customer information. Customer information is stored in a table that includes values CustomerlD and RegionalID.
You need to partition data to optimize queries by customer sorted by geographic location. T solution should minimize support costs.
Solution: You configure sharding by RegionalID.
Does this solution meet the goal?
A. Yes
B. No
A. Yes
This solution meets the goal. Sharding by RegionallD, which would segregate items by region, would help optimize sorting by geographic area. Another reason you might want to use RegionallD for sharding is that it might be necessary to have different parts of the database configured to reside in different geographies for compliance or geopolitical reasons.
Sharding partitions data horizontally to distribute data across multiple databases in a scaled-out design. This requires that the schema is the same on all of the databases involved. Sharding helps to minimize the size of individual databases, which in turn helps to improve transactional process performance. Hardware support requirements are minimized, which helps to reduce related costs. Elastic queries let you run queries across multiple shards. You can configure and manage sharding through the elastic database tools libraries or through self-sharding.
Case Study Series of questions that represent the same scenario
You have a data pipeline in Azure Synapse Analytics that ingests data from multiple sources.
You need to process personal information such as first name, last name, and social security number (SSN). Because the data comes from different sources, it contains duplicate records and it also has other issues, such as:
- In some of tie records, the names are in mixed case (Customer), while in others, they are in uppercase (CUSTOMER).
- The SSNs are sometimes hyphenated (123-45-6789), while others are missing the hyphens (123456789).
You need to remove the duplicates from the input data and normalize the records.
Solution: You transform the data using Spark pool with the Scala code below:
val dedupedDF = inputDataFrame. dropDuplicates ()
Does this solution meet the goal?
A. Yes
B. No
B. No
This solution does not meet the goal. This would remove the exact duplicates from the data frame only. The dataset may still contain records with names in mixed case or uppercase, or SSN numbers with or without hyphens.
Case Study Series of questions that represent the same scenario
You have a data pipeline in Azure Synapse Analytics that ingests data from multiple sources.
You need to process personal information such as first name, last name, and social security number (SS). Because the data comes from different sources, it contains duplicate records and it also has other issues, such as:
- In some of the records, the names are in mixed case (Customer), while in others, they are in uppercase (CUSTOMER).
- The SSNs are sometimes hyphenated (123-45-6789), while others are missing the hyphens (123456789).
You need to remove the duplicates from the input data and normalize the records.
Solution: You transform the data using Spark pool with the Python code below:
from pyspark. sql. functions import * dedupedDF = (dataFrame .select (col ("*"), upper (col ("firstName")) alias ("upperFirstName") , upper (col ("lastName")) alias ("upperLastName"), regexp_replace (col ("ssn"), "-", ""). alias ("ssnOnlyNumber") ) .dropDuplicates (["upperFirstName", "upperLastName", "ssnOnlyNumber" ]) )
Does this solution meet the goal?
A. Yes
B. No
A. Yes
This solution meets the goal. This code will first normalize the first and last names by putting everything in uppercase:
~~~
upper (col (“firstName”)).alias (“upperFirstName”)
upper (col (“lastName”)).alias (“upperLastName” )
~~~
Then it will remove the hyphens from SSN numbers:
regexp_replace (col ("ssn"),."-", "").alias ("ssnOnlyNumber")
And finally, it will remove the duplicate records:
.dropDuplicates (["upperFirstName", "upperLastName", "ssnOnlyNumber" ])
Without these transformation steps, the resulting dataset may still contain duplicates.
Case Study Series of questions that represent the same scenario
You have a data pipeline in Azure Synapse Analytics that ingests data from multiple sources.
You need to process personal information such as first name, last name, and social security number (SSN). Because the data comes from different sources, it contains duplicate records and it also has other issues, such as:
* In some of the records, the names are in mixed case (Customer), while in others, they are in uppercase (CUSTOMER).
* The SSNs are sometimes hyphenated (123-45-6789), while others are missing the hyphens (123456789).
You need to remove the duplicates from the input data and normalize the records.
Solution: You transform the data using SQL pool with the SQL query below:
WITH deduplicateData AS SELECT UPPER ( [FirstName]) as UpperFirstName, UPPER ( [LastName]) as UpperLastName, CAST ( [SSN] as INT) as NumberSSN FROM [CustomerDataExternalTable] )* SELECT * FROM deduplicateData GROUP BY UpperFirstName, UpperLastName, NumberSSN
Does this solution meet the goal?
A. Yes
B. No
B. No
This solution does not meet the goal, This query will fail to convert SSN records containing hyphens. You should use the REPLACE( function to perform this step. Shown below is the correct SQL query that would meet all the requirements:
WITH deduplicateData AS (" SELECT UPPER ( [FirstName]) as UpperFirstName, UPPER ( [LastName]) as UpperLastName, REPLACE ( [SSN], '-', ') as NumberSSN FROM [CustomerDataExternalTable] SELECT * FROM deduplicateData GROUP BY UpperFirstName, UpperLastName, NumberSSN
Case Study Series of questions that represent the same scenario
You work as an Azure data engineer.
You need to transform an incoming JSON file into a relational structure using Azure Data
Factory.
Solution: You use the flatten transformation in a mapping data flow.
Does this solution meet the goal?
A. Yes
B. No
A. Yes
This solution meets the goal. In a process called denormalization, the flatten transformation can take hierarchically structured files as an input and unroll them into individual rows.
Case Study Series of questions that represent the same scenario
You work as an Azure data engineer.
You need to transform an incoming JSON file into a relational structure using Azure Data
Factory.
Solution: You use the unpivot transformation in a mapping data flow.
Does this solution meet the goal?
A. Yes
B. No
B. No
This solution does not meet the goal. The unpivot transformation is used to normalize datasets by converting columns into rows. You can transform the selected set of columns into rows, making the data useful for relational data processing.
Case Study Series of questions that represent the same scenario
You work as an Azure data engineer.
You need to transform an incoming JSON file into a relational structure using Azure Data
Factory.
Solution: You use the pivot transformation in a mapping data flow.
Does this solution meet the goal?
A. Yes
B. No
B. No
This solution does not meet the goal. The pivot transformation is used to denormalize datasets by converting rows into columns. You can transform the unique set of rows into columns, making the data useful for analytical data processing.
Case Study Series of questions that represent the same scenario
You work with an Azure Synapse Analytics solution.
You need to transform an incoming stream based on the following criteria:
* Rows prior to the year 2000 need to be assigned to an output stream, named Processed.
* Rows after the year 2000 need to be assigned to another stream, named Unprocessed, for further processing.
Solution: You use the filter transformation in a mapping data flow.
Does this solution meet the goal?
A. Yes
B. No
B. No
This solution does not meet the goal. The filter transformation is used to filter out rows based on specific criteria. It returns a single output stream which matches the filter condition.
Case Study Series of questions that represent the same scenario
You work with an Azure Synapse Analytics solution.
You need to transform an incoming stream based on the following criteria:
* Rows prior to the year 2000 need to be assigned to an output stream, named Processed.
* Rows after the year 2000 need to be assigned to another stream, named Unprocessed, for further processing.
Solution: You use the conditional split transformation in a mapping data flow.
Does this solution meet the goal?
A. Yes
B. No
A. Yes
This solution meets the goal. The conditional split transformation in mapping data flow allows you to configure conditions for each output stream. You can define multiple output streams using different conditions.
Case Study Series of questions that represent the same scenario
You work with an Azure Synapse Analytics solution.
You need to transform an incoming stream based on the following criteria:
* Rows prior to the year 2000 need to be assigned to an output stream, named Processed.
* Rows after the year 2000 need to be assigned to another stream, named Unprocessed, for further processing.
Solution: You use the lookup transformation in a mapping data flow.
Does this solution meet the goal?
A. Yes
B. No
B. No
This solution does not meet the goal. The lookup transformation is used to reference data from another source. It appends columns from matched data to your source data, similarly to the LEFT JOIN keyword in SQL.
Case Study Series of questions that represent the same scenario
You are a data engineer. You are developing a data ingestion solution that ingests data from large pipe-delimited text files in an Azure Data Lake Storage account to a dedicated pool in Azure Synapse Analytics.
You need to load the data.
Solution:
* Create an external file format and an external data source.
* Create an external table that uses the external data source.
* Load the data from the external table.
0 of 72
Does this solution meet the goal?
A. Yes
B. No
A. Yes
The solution meets the goal. To import data into Azure Synapse Analytics, you should first create an external file format by using the CREATE EXTERNAL FILE FORMAT statement. This defines the type of file that represents the source data.
Next, you should create an external data source by using the CREATE EXTERNAL DATA SOURCE statement. This specifies the location and credentials to the Azure Data Lake Storage account.
Then, you should create an external table by using the CREATE EXTERNAL TABLE statement.
This defines the table fields, specifies its location in the storage account, and the file format that you created previously.
Finally, you should load data into the table by using CREATE TABLE AS SELECT, which allows you to write a query that selects data from the source file and place it in a new table.
Case Study Series of questions that represent the same scenario
You are a data engineer. You are developing a data ingestion solution that ingests data from large pipe-delimited text files in an Azure Data Lake Storage account to a dedicated pool in Azure Synapse Analytics.
You need to load the data.
Solution:
* Create an Azure Databricks account and a linked server.
* Create an external table that points to the Azure Databricks account.
* Load the data by running the dbutils.fs.cp command.
Does this solution meet the goal?
A. Yes
B. No
B. No
This solution does not meet the goal. Azure Databricks uses Spark clusters to execute code in notebooks. You should not create a linked server to connect to a Databricks account. The sp_addlinkedserver stored procedure allows you to connect to other SQL Server instances. The dbutils.fs.cp command allows you to copy files in Databricks. Because you need to load data into a dedicated pool, not Databricks, you should not run this command.
Case Study Series of questions that represent the same scenario
You are a data engineer. You are developing a data ingestion solution that ingests data from large pipe-delimited text files in an Azure Data Lake Storage account to Azure Synapse Analytics.
You need to load the data.
Solution:
* Create an Azure Cosmos DB account and a linked server.
* Create an external table that points to the Azure Cosmos DB account.
* Load the data by running the BULK IMPORT statement.
Does this solution meet the goal?
A. Yes
B. No
B. No
This solution does not meet the goal. Azure Cosmos DB is a multi-model, non-relational database that uses one of five APIs: SQL, Table, Cassandra, MongoDB, and Gremlin. You should not create a linked server to connect to a Cosmos DB account. The sp_addlinkedserver stored procedure allows you to connect to other SQL Server instances. BULK IMPORT allows you to bulk import data, but this command cannot import data from a Cosmos DB account.