DP-203 Dumps Flashcards by Haris Eminovic

1) You execute the following query in Azure Synapse Analytics Spark pool in workspace for the following query:

SELECT StudentID
FROM abc.dbo.myTable
WHERE name = ‘Amit’

TABLE:

StudentName: Amit
StudentID: 69
StudentStartDate: 26/05/22

What will be the output of the query?

a) Amit
b) Error
c) 69
d) Null

Answer: b
Explanation: ‘name’ column does not exist

How well did you know this?

Not at all

Perfectly

2) As a Data Engineer, you need to design an Azure Synapse Analytics dedicated SQL Pool which can meet the following goal:

Return student records from a given point in time,
Maintain current student information

How should you model the student data?

a) View
b) Temporal table
c) Slowly Changing Dimension (SCD) Type 2
d) SCD Type 7

Answer: c
Explanation: Can return information at a certain point-in-time incl. historical data

How well did you know this?

Not at all

Perfectly

3) An Azure Data Factory pipeline has the following activities:

Copy,
Wrangling data flow,
Jar,
Notebooks

Which TWO Azure services should you use to debug the activities?

a) Computer Vision
b) Data Factory
c) Azure Sentinel
d) Azure Databricks

Answer: b,d
Explanation: Computer Vision has to do with AI and Azure Sentinel is a security configuration feature. Therefore, the logical answer(s) would be b and d.

How well did you know this?

Not at all

Perfectly

4) A company needs to design an Azure Data Lake Storage solution which will include geo-zone-redundant storage (ZRS) for high availability.

What should you include in the monitoring solution for replication delays which can affect the recovery point objective (RPO)?

a) 4xx: Server error
b) Last sync time
c) Principle of least privilege
d) ARM template

Answer: b
Explanation: Options a, c, and d have nothing to do with storage redundancy or RPO…

How well did you know this?

Not at all

Perfectly

5) An automobile company uses an Azure IoT Hub for communication with the IoT devices. What solution should you recommend if you want to monitor the devices in real-time?

a) Azure Data Factory using Visual Studio
b) Azure Stream Analytics job
c) Storage Account using Azure Powershell
d) Azure virtual machine using Azure Portal

Answer: b
Explanation: None of the other options have to do with IoT devices and/or monitoring in real-time.

How well did you know this?

Not at all

Perfectly

6) A table will track the values of dimension attributes over the course of time and retain the history of the data by adding new rows as the data changes. Which Slowly Changing Dimension (SCD) type should you use?

a) Type -1
b) Type 1
c) Type 2
d) Type 3

Answer: c

How well did you know this?

Not at all

Perfectly

7) A company needs to perform batch processing in Azure Databricks once per day. Which type of databricks cluster should you use?

a) Standard
b) Interactive
c) Automated
d) Manual

Answer: c
Explanation: Standard and Interactive don’t deal with batch processing. ‘Manual’ databricks cluster doesn’t exist.

How well did you know this?

Not at all

Perfectly

8) A company is building streaming solutions in Azure Databricks. The solution needs to count events in 5 minute intervals and only report on events which arrive during the interval which will be sent to a Delta Lake table as an output. Which output mode should you use?

a) Complete
b) Partial
c) Append
d) Update

Answer: c
Explanation: Complete and Partial are not output modes. Update only deals with rows that have been changed since the last trigger.

How well did you know this?

Not at all

Perfectly

9) A company has an Azure Data Lake Storage Gen2 account called CGAmit which is protected by virtual networks. You need to design an SQL pool in Azure Synapse which will use CGAmit as the source. What should you use to authenticate to CGAmit?

a) Azure Lock
b) Shared Access Signature (SAS)
c) Active Directory Federation Services (ADFS)
d) Managed Identity

Answer: d

Explanation:
Azure Lock deals with accidental deletion of resources.
SAS deals with providing secure delegated access to resources in the storage account. ADFS deals with SSO between internet-facing applications.

How well did you know this?

Not at all

Perfectly

10) You need to recommend a solution when designing a database for an Azure Synapse Analytics dedicated SQL pool for transaction fraud which can meet the following requirements:

Users should not be able to access the actual food card numbers
Users should be able to use food cards as a feature in the models

What should you suggest?

a) Row-level-security (RLS)
b) Azure Active-Directory Pass-Through authentication
c) Transparent Data Encryption (TDE)
d) Column-level security

Answer: d

Explanation:

RLS is meant for restricting rows.
AADPT authentication is meant for authentication and not relevant here.
TDE encrypts the data but it also needs to be decrypted

How well did you know this?

Not at all

Perfectly

11) You need to suggest which format to store the data in Azure Data Lake Storage Gen2 to support the reports. The solution should minimize read times.

Read two columns from a file which contains 69 columns:

a) Parquet
b) TSV
c) AVRO

Query one record based on timestamp:

a) Parquet
b) TSV
c) AVRO

Answer: a, c

How well did you know this?

Not at all

Perfectly

12) As a data engineer, you need to aggregate data which originates in Kafka and is output to Azure Data Lake Storage Gen2. The testing team needs to implement the stream processing solution using Java.

Which service should you suggest to process the streaming data?

a) Azure Databricks
b) Azure Stream Analytics
c) Azure Sentinel
d) Azure Event Hub

Answer: a

Explanation:
Azure Sentinel and Azure Event Hub don’t processes streaming data. Further, Stream analytics doesn’t support Java (it uses SQL and JavaScript) and is therefore incorrect.

How well did you know this?

Not at all

Perfectly

13) A production team needs a solution which can stream data to Azure Stream Analytics. The solution will be having reference data as well as streaming data. Which TWO input types should you use for reference data?

a) Azure DocumentDB
b) Azure Blob Storage
c) Azure Event Hub
d) Azure SQL Database

Answer: b, d

Explanation:
DocumentDB doesn’t support streaming data.
Azure Event Hub can store streaming data but incurs a higher cost than what we need.

How well did you know this?

Not at all

Perfectly

14) You need to ensure that data in the Azure Synapse Analytics dedicated SQL pool is encrypted at rest. The solution should NOT modify applications which query the data. What should you implement?

a) Enable Transparent Data Encryption (TDE)
b) Upgrade to Premium P2 license
c) Create Azure functions
d) Use customer managed keys

Answer: a

Explanation:

Nothing in the question mentions licensing.
Azure Functions has nothing to do with encryption.
Customer managed keys are configured at the workspace level (deals with double-encryption).

How well did you know this?

Not at all

Perfectly

15) As a data engineer, you need to suggest an Azure Databricks cluster configuration which can meet the following requirements:

Minimize cost,
Reduce query latency,
Maximize the number of users that can execute queries on cluster simultaneously

Which cluster type should you suggest?

a) High concurrency cluster with auto termination
b) High concurrency cluster with autoscaling
c) Standard cluster with auto termination
d) Standard cluster with autoscaling

Answer: b

Explanation:
Standard cluster cannot share multiple tasks (such as autoscaling/termination)
High concurrency clusters cannot be terminated even if we use auto termination.

How well did you know this?

Not at all

Perfectly

16) A company needs to trigger an Azure Data Factory pipeline as soon as a file arrives in an Azure Data Lake Storage Gen2 container. Which resource should you use?

a) Microsoft.EventGrid
b) Microsoft.EventHub
c) Microsoft.IoT
d) Microsoft.CosmosDB

Answer: a

Explanation:
The question doesn’t deal with real-time data therefore IoT/CosmosDB are incorrect.
EventHub deals with telemetry data.
EventGrid is natively integrated with Synapse/DF pipelines

How well did you know this?

Not at all

Perfectly

17) As a data engineer, you need to make sure that you can audit access to Personally Identifiable Information (PII) while designing an Azure Synapse Analytics dedicated SQL pool. What should you include?

a) RLS
b) Column-level security
c) Security baseline
d) Sensitivity classifications

Answer: d

Explanation:
RLS is meant for restricting rows,
Column-level security is used to create a symmetric key to encrypt the data
Security-baseline provides guidance for database-level security recommendations. Nothing in the question is related to Azure SQL database.

How well did you know this?

Not at all

Perfectly

18) You need to design a date dimension table in an Azure Synapse Analytics dedicated SQL pool. As per the business requirement, the date dimension table will be used by all fact tables. Which distribution type should you recommend to minimize data movement?

a) Hash
b) Asterisk
c) Replicate
d) Round robin

Answer: c

Explanation:

For FACT tables, Hash distribution is used.
For DIMENSION tables REPLICATE is used.
For STAGING tables, ROUND ROBIN is used.
There is no ASTERISK distribution type in Azure.

How well did you know this?

Not at all

Perfectly

19) As a data engineer, you need to create a new notebook in Azure Databricks which will support Python as the primary language and should also support R and Scala. Which switch should you use to switch between the different languages?

a) %
b) #
c) @{}
d) @[]

Answer: a

How well did you know this?

Not at all

Perfectly

20) A company has an Azure Synapse Analytics dedicated SQL pool which contains a huge fact table. The table contains 47 columns and 4.7 BN rows and is a heap. On average, queries against the table aggregate values from approximately 69 million rows and return only two columns. You notice that queries against the fact table are extremely slow. Which type of index should you add to provide the fastest query times?

a) Non-clustered column store
b) Clustered index
c) Semi-clustered index
d) Clustered column store

Answer: d

Explanation:
Non-clustered column store doesn’t exist in Synapse Analytics
Clustered index is best for tables with less than 60 million rows considering performance
Semi-clustered index doesn’t exist in Synapse Analytics

Clustered column store is usually the best choice for large heap tables.

How well did you know this?

Not at all

Perfectly

21) An e-commerce company needs to make sure that an Azure Data Lake Storage Gen2 container is available for read workloads in a secondary region if an outage happens in the primary region. Which type of redundancy should you recommend so that your solution minimizes costs?

a) Geo-Zone-Redundant-Storage (G-ZRS)
b) Geo-Redundant-Storage (GRS)
c) Locally-Redundant-Storage (LRS)
d) Read-Access-Geo-Redundant-Storage (RA-GRS)

Answer: d

Explanation:

GRS doesn’t initiate automatic failover, and hence doesn’t meet the requirements.
LRS provides redundancy in a single region only.

How well did you know this?

Not at all

Perfectly

22) As a data engineer, you need to configure an Azure Databricks workspace which is currently in the Standard pricing tier to support autoscaling all-purpose clusters. The solution should meet the following requirements:

Reduce time taken to scale the number of workers while minimizing costs
Automatically scale down workers when the cluster is underutilized for five minutes

What should be your first step?

a) Upgrade Azure Databricks workspace to Premium pricing tier
b) Create logic apps for the workspace
c) Enable a log analytics workspace
d) Create a storage account

Answer: a

How well did you know this?

Not at all

Perfectly

23) A company uses Azure Stream Analytics to accept data from Azure Event Hubs and to output the data to an Azure Blob Storage account. As a data engineer, you need to output the count of records received from the last 7 minutes, every minute. Which window function should you use?

a) Sliding
b) Tumbling
c) Hopping
d) Snapshot

Answer: c

Explanation:

How well did you know this?

Not at all

Perfectly

24) An Azure Data Factory pipeline needs to meet the following requirements:

Support backfilling existing data in the source table
Automatically retry execution if the pipeline fails due to throtling limits or concurrency

Which type of trigger should you recommend?

a) Schedule
b) Tumbling window
c) Hopping
d) Snapshot

Answer: b

Explanation:

Hopping/Snapshot doesn’t support retry executions and dealing with concurrency issues.
Schedule could be an option but Tumbling Window is better for setting policies.

How well did you know this?

Not at all

Perfectly

25) As a data engineer, you need to design an analytical solution which will use Python functions for near real-time data from Azure Event Hubs. Which solution should you recommend to perform statistical analysis to minimize latency? a) Azure Databricks b) Azure Stream Analytics c) Azure Sentinel d) Azure Event Hub

Answer: a Explanation: Sentinel has to do with security, and Event Hub doesn't deal with streaming data. Azure Stream Analytics doesn't support Python.

26) You need to analyze Azure Data Factory pipeline failures from the last 69 days. What should you use? a) Acitivity log blade b) Resource health blade c) Azure Storage Account d) Azure Monitor

Answer: d

27) You need to make sure that the data in an Azure Data Lake Storage Gen2 storage account will remain available if a data center fails in the primary Azure region. Which replication type should you use for the storage account to minimize costs? a) Locally-Redundant-Storage (LRS) b) Zone-Redundant-Storage (ZRS) c) Geo-Redundant-Storage (GRS) d) Geo-Zone-Redundant-Storage (GZRS)

Answer: b Explanation: GRS/GZRS would also work but will not minimize the costs. Therefore, ZRS is the better option.

28) A company needs to design an Azure Data Factory Pipeline which will include mapping data flow. As per the business requirement, you need to transform JSON-formatted data into a tabular dataset. Which transformation method should you use in the mapping flow so that the dataset only has one row for each item in the array? a) Flatten b) Broaden c) Modify row d) Pivot

Answer: a Explanation: Broaden/Modify row are not transformation types in ADF. Pivot doesn't handle arrays.

29) You need to use a streaming data solution which uses Azure Databricks. The solution should meet the following requirements with respect to output data which contains e-book sales details: - E-book sales transactions won't be updated. Only new rows will be added to adjust a sale. - You are required to suggest an output mode for the dataset which will be processed by using Structured Streaming which reduces duplicate data. What should you suggest? a) Append b) Complete c) Change d) Update

Answer: d Explanation: Append won't work as we need to reduce duplicate data. Complete replaces the entire table with one complete batch. Change is not an output mode.

30) While monitoring an Azure Stream Analytics job, you notice that the backlogged input events count has been 17 for the last hour. What should you do to reduce the backlogged input events count? a) Decrease streaming units for the job b) Delete the job c) Associate a storage account for the job d) Increase streaming units for the job

Answer: d Explanation: Decreasing streaming units for the job will increase the backlog Don't delete the job LOL No need of storage accounts as the question has nothing to do with storing data.

31) As a data engineer, you need to design the folder structure for Azure Data Lake Storage Gen2. The data should be secured by 'FocusArea'. Frequent queries will include data from the current year or current month. Which folder structure should you suggest for minimal delay in queries and simplified folder security? a) /FocusArea/{DataSource}/{DD}/{MM}/{YYYY}/{FileData}_{YYYY}_{MM}_{DD}.xls b) {DD}/{MM}/{YYYY}/FocusArea/{DataSource}/{FileData}/_{YYYY}_{MM}_{DD}.xls c) {YYYY}/{MM}/{DD}/FocusArea/{DataSource}/{FileData}/_{YYYY}_{MM}_{DD}.xls d) /FocusArea/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}_{YYYY}_{MM}_{DD}.xls

Answer: d Explanation: Data needs to be secured by FocusArea, b/c are incorrect because the structure doesn't start with the FocusArea. Option a's folder strucutre starts with 'DD' for the date, which indicates 'day'. We want either the current year or current month.

32) A company has a data lake which is accessible only via an Azure virtual network. You are building an SQL pool in Azure Synapse which will use data from the data lake and is planned to load data to the SQL pool every hour. You need to make sure that the SQL pool can load the data from the data lake. Which TWO actions should you perform? a) Create a service principal b) Create a managed identity c) Add an Azure Active Directory Federation Servcice (ADFS) account d) Configure managed identity as credentials for the data loading process

Answer: b, d Explanation: Whenever virtual networks are mentioned, managed identity is the best option.

33) You need to suggest a Stream Analytics data output format to make sure that queries from Databricks and PolyBase against the files encounter with less errors. The solution should make sure that the files can be queried fast and that the data type information is kept intact. What should you suggest? a) Parquet b) TSV c) JSON d) AVRO

Answer: a Explanation: The solution should maintain the metadata itself (data type information needs to be kept intact)

34) You need to configure an Azure Databricks cluster to automatically connect to Azure Data Lake Storage Gen2 with the help of Azure AD Integration. How should you configure the cluster? Advanced option to be enabled: a) Premium b) Standard c) Azure Data Lake Storage Credential Pass-through Tier: a) Premium b) Standard c) Azure Data Lake Storage Credential Pass-through

Answer: c, a Explanation: For the Tier -

35) You are required to copy blob data from an Azure Storage account to the data warehouse with the help of Azure Data Factory. The solution should meet the following requirements: - Make sure that the data remains in the US Central region at all times, Which type of integration runtime should you use? a) Data sovereignty runtime b) Azure-SSIS c) Self-hosted d) Azure Integration runtime

Answer: d Explanation: Option a is not an IR type Options b and c can't guarantee that the requirement is met.

36) You need to design an Azure Synapse solution which can provide a query interface for the data stored in an Azure Storage account which is only accessible from a virtual network. Which authentication mechanism should you recommend to ensure that the solution can access the source data? a) Managed Identity b) Bastion Host c) Shared Access Signatures (SAS) d) Azure Active Directory Authentication

Answer: a Explanation: Managed Identity is required when your storage is attached to a virtual network.

37) A company has 7 Azure Data Factory pipelines. You need to label each pipeline with the primary purpose of either extract, transform, or load. The labels should be available for grouping and filtering when using monitoring experience in Data Factory. What should be added to each pipeline? a) Caption b) Subtitles c) Annotation d) Tags

Answer: c Explanation: a,b are not real labels. Tags are used with key-value pairs.

38) An e-commerce company has an Azure Data Factory component named CGA which contains a linked service. There is an Azure Key Vault which contains an encryption key named 'TestKey'. What should be your first step to encrypt CGA using the encryption key 'TestKey'? a) Build a self-hosted integration runtime b) Create a new key vault c) Create a managed identity d) Remove linked service from CGA

Answer: d Explanation:

39) You need to copy files and folders from storage accounts Storage7 to Storage8 using Data Factory copy activity. The solution should meet the following requirements: - The original folder structure should be maintained - No transformations should be performed How should you setup the copy activity? Dataset source type: a) Binary b) Avro c) Preserve hierarchy Copy activity: a) Binary b) Avro c) Preserver hierarchy

Answer: a,c

40) You want to prevent the development team users from seeing the full email addresses in the email column having an SQL pool in Azure Synapse. The users should be able to see the values in the format as ZZ@ZZZZ.com instead. Which TWO options can meet this requirement? a) Set a mask on the email column from Azure Portal b) Select mask row from Azure Portal c) Set an email mask on the email column MS SQL Server Management Studio d) Create a key vault for the email column

Answer: a,c Explanation: Option B is not a function we can use in Azure Portal. Option D is related to encryption and cannot be applied to column level masking.

41) An Azure Data Lake Storage Gen2 container contains TSV files. The file size ranges from 7 KB to 3 GB. What should you do to ensure that the files are stored in the container are optimized for batch processing? a) Delete the files b) Merge the files c) Compress the files d) Convert files to Parquet

Answer: b Explanation: For better performance in batch processing it is recommended to merge files into larger files (256 MB - 100 GB range)

42) A company has an Azure Synapse Analytics Apache Spark Pool called TestPool. You need to load JSON files from an Azure Data Lake Storage Gen2 container into the tables in TestPool. You need to load the files into the tables where structure and data types vary by file. What should you do so that the solution maintains the source data types? a) Load data using PySpark b) Load data using a openrowset T-SQL command in the Synapse Analytics serverless SQL Pool c) Load data using Python d) Load data using Sentinel

Answer: a Explanation: Sentinel is a security services application and not applicable here. Python is a programming language and not a compute engine.

43) An e-commerce company has an Azure Databricks resource which needs to log actions that relate to changes in compute for the Databricks resource. Which Databricks service should you log? a) RDP b) CosmosDB c) Clusters d) Workspace

Answer: c Explanation:

44) You need to configure a batch dataset in Parquet format where data files will be generated using Azure Data Factory and stored in Azure Data Lake Storage Gen2. You are required to reduce storage costs for the files which will be consumed by an Azure Synapse analytics serverless SQL pool. What should be your first step? a) Configure snappy compression for files b) Store data as AVRO files c) Create an external table d) Use archive tier

Answer: a Explanation:

45) A company has a partitioned table in an Azure Synapse Analytics dedicated SQL pool. You need to create queries to maximize the advantages of partition elimination. What should you include in your T-SQL queries? a) WHERE b) ORDER BY c) SUM d) AVG

Answer: a Explanation: Operations b,c, and d are not related to partitioning.

46) A company is planning on migrating data from the database to a star schema in a Synapse Analytics dedicated SQL pool. Currently, SQL server database uses a third normal form schema. You need to design dimension tables while optimizing read operations. What should be included in the solution? Data transformation for dimension tables by: a) Denormalize to 2NF b) New Identity columns c) Normalizing to fifth normal form Primary key column in the dimension tables: a) Denormalize to 2NF b) New Identity columns c) Normalizing to fifth normal form

Answer: a, b

47) A company uses Azure Event Hub to ingest data and Azure Stream Analytics cloud job to analyze the data for a real-time data analysis solution. Currently, the cloud job is configured to use 127 Streaming Units. Which TWO actions should you perform to optimize performance for Azure Stream Analytics jobs? a) Decrease stream units b) Partition data input using query parallelization c) Implement computer vision d) Partition data output using query parallelization

Answer: b, d Explanation: Best in this scenario is to partition both input and output streams to the same number of partitions.

48) An automobile company uses Azure IoT Hub to communicate with various IoT devices. What solution should you design so that the company is able to monitor the devices in real-time? a) Data Factory virtual machine using Azure Portal b) Data Factory virtual machine using CLI c) Stream Analytics job using Azure Portal d) Data Factory virtual machine using Powershell

Answer: c Explanation: CLI/Powershell won't work here. For real-time data Stream Analytics is the better option than Data Factory.

49) You have created an external table named ExtTable in Azure Data Explorer. Now, a database user needs to run a KQL (Kusto Query Language) query on this external table. Which of the following functions should be used to refer to this table? a) external_table() b) access_table() c) ext_table() d) None of the above

Answer: a Explanation:

50) Your company wants you to ingest data onto cloud data platforms in Azure. Which data processing framework will you use? a) OLTP b) ETL c) ELT

Answer: c Explanation: ELT is a typical process for ingesting data from an on-premises database into Azure cloud.

51) You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb. You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace: CREATE TABLE mytestdb.myParquetTable ( EmployeeId int, EmployeeName string, EmployeeStartDate date ) Using Parquet, you then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data: * EmployeeName: Peter * EmployeeId: 1001 * EmployeeStartDate: 28-July-2022 One minute later, you execute the following query from a serverless SQL pool in MyWorkspace: SELECT EmployeeId FROM mytestdb.dbo.myParquetTable WHERE name = "Peter"; What will be returned by the query? a) 24 b) An error c) Null

Answer: b Explanation: We reference 'name' instead of 'EmployeeName' and hence an error will be produced.

52) In structured data, you define the data type at query time. a) True b) False

Answer: b Explanation: Data is defined at query time in unstructured data.

53) When you create a temporal table in Azure SQL Database, it automatically creates a history table in the same database to capture historical records. Which of the following statements is true about temporal tables and history tables (select all options that apply): a) A temporal table must have 1 primary key b) To create a temporal table, system versioning must be set to On c) To create a temporal table, system versioning must be set to Off d) It is mandatory to mention the name of the history table when you create the temporal table e) If you don't specify the name for the history table, the default naming convention is used for the history table f) You can specify the table constraints for the history table

Answer: a, b, e Explanation:

54) To create Data Factory instances, the user account that you use to sign into Azure must be a member of (select all options that apply): a) Contributor b) Owner Role c) Administrator of the Azure subscription d) Write

Answer: a,b,c

55) You need to design an application that can accept market information as an input. Using the machine-learning classification model, the application will classify the the input data into two categories: - Car models that sell more with buyers between 18 - 40 years - Car models that sell more with buyers above 40 What would you recommend to train the model? a) Power BI Models b) Text Analytics API c) Computer Vision API d) Apache Spark MLlib

Answer: d Explanation: Machine Learning Library

56) You are designing an Azure Stream Analytics solution that will analyze Twitter data. You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once. Solution: You use a session window that uses a timeout size of 10 seconds. Does this meet the goal? a) Yes b) No

Answer: b

57) You are designing an Azure Stream Analytics solution that will analyze Twitter data. You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once. Solution: You use a sliding, and you set the window size to 10 seconds. Does this meet the goal? a) Yes b) No

Answer: b

58) You are designing an Azure Stream Analytics solution that will analyze Twitter data. You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once. Solution: You use a tumbling window, and you set the window size to 10 seconds. Does this meet the goal? a) Yes b) No

Answer: a

59) What are the key components of Azure Data Factory? Select all that apply: a) Database b) Connection String c) Pipelines d) Activities e) Datasets f) Linked Services g) Data Flows h) Integration Runtimes

Answer: c, d, e, f, g, h

60) Which of the following are valid trigger types of Azure Data Factory? Select all that apply: a) Monthly Trigger, b) Scheduled Trigger, c) Overlap Trigger, d) Tumbling window trigger, e) Event-based trigger

Answer: b, d, e

61) Duplicating customer content for redundancy and meeting service-level-agreements (SLAs) is Azure Maintainability. a) Yes b) No

Answer: b Explanation: This is Azure High Availability

62) You have an Azure Synapse Analytics dedicated SQL pool that contains a table named contacts. Contacts contains a column named Phone. You need to ensure that users in a specific role only see the last four digits of a phone number when querying the Phone column. What should you include in the solution? a) Column encryption b) Dynamic data masking c) A default value d) Table partitions e) Row-level-security (RLS)

Answer: b Explanation: Frequently used for masking credit card numbers, emails etc...

63) A company has a data lake which is accessible only via an Azure virtual network. You are building an SQL pool in Azure Synapse which will use data from the data lake and is planned to load data into the SQL pool every hour. You need to make sure that the SQL can load the data from the data lake. Which TWO actions should you perform? a) Create a service principal b) Create a managed identity c) Add an Azure Active Directory Federation Services (ADFS) account d) Configure managed identity as credentials for the data loading process

Answer: b, d

64) Which role works with Azure Cognitive Services, Cognitive Search, and the Bot Framework? a) A data engineer, b) A data scientist c) An AI engineer

Answer: c

65) Which role is correct for a person who works being responsible for the provisioning and configuration of both on-premises and cloud data platform technologies? a) Data engineer b) Data scientist c) AI engineer

Answer: a

Who performs advanced analytics to help drive value from data? a) Data engineer b) Data scientist c) AI engineer

Answer: b

67) Choose the valid examples of structured data: a) MS SQL Server b) Binary files c) Azure SQL Database d) Audio files e) Azure SQL Data Warehouse f) Image files

Answer: a, c, e

68) Choose the valid examples of unstructured data: a) MS SQL Server b) Binary files c) Azure SQL Database d) Audio files e) Azure SQL Data Warehouse f) Image files

Answer: b, d, f

69) Azure Databricks is a: a) Data analytics platform b) AI platform c) Data ingestion platform

Answer: a

70) Azure Databricks encapsulates which Apache Storage technology? a) Apache HDInsight b) Apache Hadoop c) Apache Spark

Answer: c

71) Which security features does Azure Databricks not support? a) Azure Active Directory b) Shared Access Keys (SAS) c) Role-based access

Answer: b Explanation: SAS is used with Azure Storage Accounts

72) Which of the following Azure Databricks is used for support for R, SQL, Python, Scala, and Java? a) MLlib, b) GraphX c) Spark Core API

Answer: c

73) Which notebook format is used in Azure Databricks? a) DBC b) .notebook c) .spark

Answer: DBC Explanation: There are no .notebook or .spark file formats in Databricks

74) You are designing a data engineering solution for data stream processing. You need to recommend a solution for data ingestion, in order to meet the following requirements: - Ingest millions of events per second, - Easily scale from streaming megabytes of data to terabytes while keeping control over when and how much to scale - Integrate with Azure functions - Natively connected with Stream Analytics to build an end-to-end serverless streaming solution. What would you recommend? a) Azure Cosmos DB b) Apache Spark c) Azure Synapse Analytics d) Azure Event Hubs

Answer: d

75) You are a data engineer implementing a lambda architecture on MS Azure. You use an open-source big data solution to collect, process, and maintain the data. The analytical data store performs poorly. You must implement a solution that meets the following requirements: - Provide data warehousing - Reduce ongoing management activities - Deliver SQL query responses in less than one second You need to create an HDInsight cluster to meet the requirements. Which type of cluster should you create? a) Apache HBase b) Apache Hadoop c) Interactive Query d) Apache Spark

Answer: d Explanation: Apache Spark supports interactive queries through spark-sql and has data warehousing capabilities.

76) Which data platform technology is a globally distributed, multi-model database that can perform queries in less than a second? a) SQL Database b) Azure SQL Database c) Apache Hadoop d) Cosmos DB e) Azure SQL Synapse

Answer: d

77) The open-source world offers four types of No-SQL databases. Select all options that are applicable. a) SQL Database b) Apache Hadoop c) Key-vault store d) Document database e) Graph database f) Column database g) Cosmos DB h) Azure SQL Synapse

Answer: c, d, e, f

78) Azure Databricks is the least expensive choice when you want to store data but don't need to query it. a) Yes b) No

Answer: b Explanation: Azure Storage is the least expensive option.

79) Unstructured data is stored in nonrelational systems, commonly called unstructured or NoSQL. a) Yes b) No

Answer: a

80) You are designing an Azure Stream Analytics job to process incoming events from sensors in retail environments. You need to process the events to produce a running average of shopper counts during the previous 15 minutes, calculated at five minute intervals. Which type of window should you use? a) Snaphot b) Tumbling c) Hopping d) Sliding

Answer: c Explanation:

81) You are implementing an Azure Data Lake Storage Gen2 account. You need to ensure that data will be accessible for both read and write operations, even if an entire data center (zonal or non-zonal) becomes unavailable. Which kind of replication would you use for the storage account? a) Locally-redundant storage (LRS) b) Zone-redundant storage (ZRS) c) Geo-redundant storage (GRS) d) Geo-zone-redundant storage (GZRS)

Answer: b

82) You have an Azure Data Lake Storage Gen2 container that contains 100 TB of data. You need to ensure that the data in the container is available for read workloads in a secondary region if the primary region has an outage. The solution must minimize costs. Which type of data redundancy should you use? a) Geo-redundant storage (GRS) b) Read-access-geo-redundant storage (RA-GRS) c) Zone-redundant storage (ZRS) d) Locally-redundant storage (LRS)

Answer: b

83) You plan to implement an Azure Data Lake Storage Gen2 account. You need to ensure that the data lake will remain available if a data center fails in the primary Azure region. The solution must minimize costs. Which type of replication should you use for the storage account? a) Locally-redundant storage (LRS) b) Zone-redundant storage (ZRS) c) Geo-redundant storage (GRS) d) Geo-zone-redundant storage (GZRS)

Answer: b

84) You need to design an Azure Synapse Analytics SQL pool that meets the following requirements: - Can return an employee record from a given point in time - Maintains the latest employee information - Minimizes query complexity How should you model the employee data? a) As a temporal table b) As a SQL graph table c) As a degenerate dimension table d) As a Type 2 Slowly Changing Dimension (SCD) table

Answer: d

85) You have an SQL pool in Azure Synapse that contains a table named dbo.Customers. The table contains a column named Email. You need to prevent non administrative users from seeing the full email addresses in the email column. The users must see the email addresses in the format of abc@xxxx.com instead. What should you do? a) From MS SQL Server Management Studio, set an email mask on the email column. b) From the Azure portal, set a mask on the email column. c) From MS SQL Server Management Studio, grant the SELECT permission to the users for all of the columns in dbo.Customer table except for the Email column. d) From the Azure Portal, set a sensitivity classification of Confidential for the Email column.

Answer: b

86) You have an SQL pool in Azure Synapse. A user reports that queries against the pool take longer than expected to complete. You need to add monitoring to the underlying storage to help diagnose the issue. Which two metrics should you monitor? a) Cache hit percentage b) Active queries c) Snapshot storage size d) DWU limit e) Cache used percentage

Answer: a, e

87) You have an SQL pool in Azure Synapse. You discover that some queries fail or take a long time to complete. You need to monitor for transactions that have rolled back. Which dynamic management view should you query? a) sys.dm_pdw_nodes_tran_database_transacations b) sys.dm_pdw_waits c) sys.dm_pdw_request_steps d) sys.dm_pdw_exec_sessions

Answer: a

88) You are designing an Azure Databricks table. The table will ingest an average of 20 million streaming events per day. You need to persist the events in the table for use in incremental load pipeline jobs in Azure Databricks. The solution must minimize storage costs and incremental load times. What should you include in the solution? a) Partition by DateTime fields b) Sink to Azure Queue storage c) Include a watermark column d) Use a JSON format for physical data storage

Answer: b **NOTE: Databricks ABS-AQS connector is deprecated. Databricks recommends using Auto Loader instead.

89) You have a partitioned table in an Azure Synapse Analytics dedicated SQL pool. You need to design queries to maximize the benefits of partition elimination. What should you include in the Transact-SQL statements? a) JOIN b) WHERE c) DISTINCT d) GROUP BY

Answer: b Explanation: Adding a WHERE clause allows the query optimizer to access only the relevant partitions to satisfy the filter criteria of the query - which is what partition elimination is all about.

90) You have an Azure Synapse Analytics dedicated SQL pool that contains a large fact table. The table contains 50 columns and 5 billion rows and is a heap. Most queries against the table aggregate values from approximately 100 million rows and return only two columns. You discover that the queries against the fact table are very slow. Which type of index should you add to provide the fastest query times? a) nonclustered columnstore b) clustered columnstore c) nonclustered d) clustered

Answer: b

91) You have an Azure Databricks workspace named workspace1 in the Standard pricing tier. You need to configure workspace1 to support autoscaling all-purpose clusters. The solution must meet the following requirements: - Automatically scale down workers when the cluster is underutilized for three minutes - Minimize the time it takes to scale to the maximum number of workers - Minimize costs What should you do first? a) Enable container services for workspace1 b) Upgrade workspace1 to Premium pricing tier c) Set cluster mode to high concurrency d) Create a cluster policy in workspace1

Answer: b

92) You have an enterprise-wide Azure Data Lake Storage Gen2 account. The data lake is accessible only through an Azure virtual network named VNET1. You are building an SQL pool in Azure Synapse that will use data from the data lake. Your company has a sales team. All the members of the sales team are in an Azure Active Directory group named Sales. POSIX controls are used to assign the Sales group access to the files in the data lake. You plan to load data to the SQL pool every hour. You need to ensure that the SQL can load sales data from the data lake. Which THREE actions should you perform? Each correct answer presents a part of the solution. a) Add the managed identity to the sales group b) Use the managed identity as the credentials for the data load process c) Create a shared access signature (SAS) d) Add you Azure Active Directory account to the Sales group e) Use the shared access signature (SAS) as the credentials for the data load process f) Create a managed identity

Answer: f, a, b

93) You are moving data from an Azure Data Lake Storage Gen2 to Azure Synapse Analytics. Which Azure Data Factory integration runtime would be used in a data copy activity? a) Azure pipeline b) Azure SSIS c) Azure d) Self hosted

Answer: c Explanation: The Azure IR is used when copying data between two Azure data platforms.

94) You are developing a solution which will use Azure Stream Analytics. The solution will accept an Azure Blob storage file named Customers. The file will contain both in-store and online customer details. The online customers will provide an email address. You have a file in Blob storage named 'LocationIncomes' that contains median income based on location. You must output the data to an Azure SQL database for immediate use and to Azure Data Lake Storage Gen2 for long term retention. Solution: You implement a Stream Analytics job that has two streaming inputs, one query, and two outputs. Does this meet the goal? a) Yes b) No

Answer: b Explanation: You implement a Stream Analytics job that has one streaming input, one reference input, one query, and two outputs

95) You are developing a solution which will use Azure Stream Analytics. The solution will accept an Azure Blob storage file named Customers. The file will contain both in-store and online customer details. The online customers will provide an email address. You have a file in Blob storage named 'LocationIncomes' that contains median income based on location. You must output the data to an Azure SQL database for immediate use and to Azure Data Lake Storage Gen2 for long term retention. Solution: You implement a Stream Analytics job that has one streaming input, one reference input, two queries, and four outputs. Does this meet the goal? a) Yes b) No

Answer: a

96) You have an Azure Data Lake storage account that contains a staging zone. You need to design a daily process to ingest incremental data from the staging zone, transform the data by executing an R script, and then insert the transformed data into a data warehouse in Azure Synapse Analytics. Solution: You use an Azure Data Factory scheduled trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. Does this meet the goal? a) Yes b) No

Answer: a

97) Which Azure Data Factory component contains the transformation logic or the analysis commands of the Azure Data Factory's work? a) Linked services b) Datasets c) Activities d) Pipelines

Answer: c

98) You have an Azure Data Factory that contains 10 pipelines. You need to label each pipeline with its main purpose of either ingest, transform, or load. The labels must be available for grouping and filtering when using the monitoring experience in Data Factory. What should you add to each pipeline? a) a resource tag, b) a user property c) an annotation d) a run group ID e) a correlation ID

Answer: c Explanation: By adding annotations, you can easily filter and search for specific factory resources

99) You have an Azure Storage account and an Azure SQL data warehouse in the UK South region. You need to copy blob data from the storage account to the data warehouse by using Azure Data Factory. The solution must meet the following requirements: - Ensure that the data remains in the UK South region at all times - Minimize administrative effort Which type of integration runtime should you use? a) Azure Integration runtime b) Self-hosted IR c) Azure-SSIS IR

Answer: a

100) You are planning to use an Azure Databricks cluster for a single user. Which type of databricks cluster should you use? a) Standard b) Single node c) High concurrency

Answer: a

101) You are planning to use an Azure Databricks cluster that provides fine-grained sharing for maximum resource utilization and minimum query latencies. It should also be a managed cloud resource. Which type of databricks cluster should you use? a) Standard b) Single node c) High concurrency

Answer: c

102) You are planning to use an Azure Databricks cluster with no workers and runs Spark jobs on the driver node. Which type of databricks cluster should you use? a) Standard b) Single node c) High concurrency

Answer: b

103) Which Azure Data Factory component orchestrates a transformation job or runs a data movement command? a) Linked services b) Datasets c) Activities

Answer: a

104) You have an Azure virtual machine that has a Microsoft SQL Server installed. The server contains a table named Table1. You need to copy the data from Table1 into an Azure Data Lake Storage Gen2 account by using an Azure Data Factory V2 copy activity. Which type of integration runtime should you use? a) Azure IR b) Self-hosted IR c) Azure SSIS IR

Answer: b

105) Which browsers are recommended for the best use with Azure Databricks? Select all that apply. a) Chrome b) Firefox c) Safari d) Edge e) Explorer f) Mobile browsers

Answer: a, b, c, d

106) How do you connect your spark cluster to the Azure blob? a) By calling the .connect() function on the Spark cluster b) By mounting it c) By calling the .connect() function on the Azure blob

Answer: b

107) How does Spark connect to databases like MySQL, Hive, and other data stores? a) JDBC, b) ODBC, c) Using the REST API Layer

Answer: a

108) You need to trigger an Azure Data Factory pipeline when a file arrives in an Azure Data Lake Storage Gen2 container. Which resource provider should you enable? a) Microsoft.sql b) Microsoft.Automation c) Microsoft.EventGrid d) Microsoft.EventHub

Answer: c

109) You plan to perform batch processing in Azure Databricks once daily. Which Azure Databricks cluster should you choose? a) High concurrency b) Interactive c) Automated

Answer: c

110) Which Azure Data Factory component contains the transformation logic or the analysis commands of the Azure Data Lake factory's work? a) Linked services b) Datasets c) Activities d) Pipelines

Answer: c

111) You have an Azure Databricks resource. You need to log actions that relate to compute changes triggered by the Databricks resources. Which Databricks services should you log? a) workspace b) SSH c) DBFS d) clusters e) jobs

Answer: d Explanation:

112) Which Azure data platform is commonly used to process data in an ELT framework? a) Azure Data Factory b) Azure Databricks c) Azure Data Lake Storage

Answer: a

113) Which Azure service is the best choice to manage and govern your data? a) Azure Data Factory b) Azure Purview c) Azure Data Lake Storage

Answer: b Explanation: Purview is a data governance solution

114) Applications that publish messages to Azure Event Hub very frequently will get the best performance using Advance Message Queuing Protocola (AMQP) because it establishes a persistent socket. a) True b) False

Answer: a

115) You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned table named dbo.Sales and a staging table named stg.Sales that has the matching table and partition definitions. You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in stg.Sales. The solution must minimize load times. What should you do? a) Insert the data from stg.Sales to dbo.Sales b) Switch the first partition from dbo.Sales to stg.Sales c) Switch the first partition from stg.Sales to dbo.Sales d) Update dbo.Sales from stg.Sales

Answer: b Explanation:

116) You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads: * A workload for data engineers who will use Python and SQL * A workload for jobs that will run notebooks that use Python, Spark, Scala, and SQL * A workload that data scientists will use to perform ad-hoc analysis in Scala and R The enterprise architecture team identifies the following standards for Azure Databricks environments: - The data engineers must share a cluster - The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster - All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists. You need to create the databricks clusters for the workloads. Solution: You create a high-concurrency cluster for each data scientist, a high-concurrency cluster for the data engineers, and a Standard cluster for the jobs. Does this meet the goal? a) Yes b) No

Answer: b

117) You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads: * A workload for data engineers who will use Python and SQL * A workload for jobs that will run notebooks that use Python, Spark, Scala, and SQL * A workload that data scientists will use to perform ad-hoc analysis in Scala and R The enterprise architecture team identifies the following standards for Azure Databricks environments: - The data engineers must share a cluster - The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster - All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists. You need to create the databricks clusters for the workloads. Solution: You create a Standard cluster for EACH data scientist, a High concurrency cluster for the data engineers, and a standard cluster for the jobs. Does this meet the goal? a) Yes b) No

Answer: a Explanation: Standard clusters are recommended for a single user. Standard clusters can run workloads developed in any language: Python, R, Scala, and SQL

118) If an event hub goes offline before a consumer group can process the events it holds, those events will be lost. a) True b) False

Answer: b Explanation: Events are persistent.

119) You are a data engineer for Contoso. You want to view key health metrics for of your Stream Analytics jobs. Which tool in Streaming Analytics should you use? a) Dashboards b) Alerts c) Diagnostics

Answer: a

120) Publishers can use either HTTPS or AMQP. AMQP opens a socket and can send multiple messages over that socket. How many default partitions are available? a) 1 b) 2 c) 4 d) 8 e) 12

Answer: c Explanation: Event hubs default to 4 partitions

121) You are designing an enterprise data warehouse in Azure Synapse Analytics that will contain a table named Customers. Customers will contain credit card information. You need to recommend a solution to provide salespeople with the ability to view all the entries in Customers. The solution must prevent the salespeople from viewing or inferring the credit card information. What should you use in the recommendation? a) data masking b) Always encrypted c) column-level security d) row-level security

Answer: c

122) You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool. Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for use in inventory reports. The inventory reports will use the data and additional WHERE parameters depending on the report. The reports will be produced once daily. You need to implement a solution to make the dataset available for the reports. The solution must minimize query times. What should you implement? a) an ordered clustered columnstore index b) a materialized view c) result set caching d) a replicated table

Answer: b Explanation: Best choice for complex analytical tables to get fast performance without having to change the queries

123) You are designing a partition strategy for a fact table in an Azure Synapse Analytics dedicated SQL pool. The table has the following specifications: - Contain sales data for 20,000 products - Use hash distribution on a column named ProductID - Contain 2.4 billion records for the years 2021 and 2022 Which number of partition ranges provides optimal compression and performance for the clustered columnstore index? a) 40 b) 240 c) 400 d) 2400

Answer: a Explanation: Rule for calculating partitions: # of records / (1 million * 60) = 2,400,000,000 / 60,000,000 = 40

124) You are designing a security model for an Azure Synapse Analytics dedicated SQL pool that will support multiple companies. You need to ensure that users from each company can view only the data of their respective company. Which TWO objects should you include in your solution? Each correct answer presents part of the solution. a) a security policy b) a custom role-based-access control (RBAC) role c) a function d) a column encryption key e) asymmetric keys

Answer: a, b

125) You have an Azure Synapse Analytics job that uses Scala. You need to view the status of the job. What should you do? a) From Synapse Studio, select the workspace. From monitor, select the SQL requests. b) From Azure Monitor, run a Kusto query against the AzureDiagnostics table c) From Synapse Studio, select the workspace. From monitor, select Apache Spark applications. d) From Azure Monitor, run a Kusto query against the SparkLoggingEvent_CL table

Answer: c

126) You have an Azure Synapse Analytics database, within this, you have a dimensional table named Stores that contains store information. There is a total of 263 stores nationwide. Store information is retrieved in more than half of the queries that are issued against the database. These queries include staff information by store, sales information per store, and finance information. You want to improve the query performance of these queries by configuring the table geometry of the stores table. Which is the appropriate table geometry to select for the stores table? a) Round robin b) Non-clustered c) Replicated table

Answer: c Explanation: Non clustered is not a valid table geometry Round robin is best used for improving data loads

127) What is the default port for connecting to an enterprise data warehouse in Azure Synapse Analytics? a) TCP port 1344 b) UDP port 1433 c) TCP port 1433

Answer: c

128) How long is the Recovery Point Objective for Azure Synapse Analytics? a) 4 hours b) 8 hours c) 12 hours d) 16 hours

Answer: b

129) You have an enterprise data warehouse in Azure Synapse Analytics named DW1 on a server named Server1. You need to verify whether the size of the transaction log file for each distribution of DW1 is smaller than 160 GB. What should you do? a) On the master database, execute a query against the sys.dm_pdw_nodes_os_performance_counters dynamic management view b) From Azure Monitor in the Azure Portal, execute a query against the logs of DW1 c) On DW1, execute a query against the sys.database_files dynamic management view d) Execute a query against the logs of DW1 by using the Get-AzOperationalInsightSearchResult PowerShell cmdlet

Answer: a

130) You have an enterprise data warehouse in Azure Synapse Analytics. You need to monitor the data warehouse to identify whether you must scale up to a higher service level to accommodate the current workloads. Which is the best metric to monitor? More than one answer choice may achieve the goal. Select the BEST answer. a) CPU percentage b) DWU used c) DWU percentage d) Data IO percentage

Answer: b Explanation: For service level objectives, it is best to review DWU limits (DWU usage)

131) You are a data architect. The data engineering team needs to create a synchronization of data between an on-premises MS SQL Server database to Azure SQL database. Ad-hoc and reporting queries are being overutilized on the on-premises production instance. The synchronization process must: - Perform an initial data synchronization to Azure SQL Database with minimal downtime - Perform bi-directional data synchronization after initial synchronization. You need to implement this synchronization solution. Which synchronization method should you use? a) Transactional replication b) Data Migration Assistant (DMA) c) Backup and Restore d) SQL Server Agent Job e) Azure SQL Data Sync

Answer: e Explanation: Let's you synchronize the data you select bi-directionally across multiple databases, both on-premises and in the cloud

132) You have an Azure subscription that contains an Azure Storage account. You plan to implement changes to a data storage solution to meet regulatory and compliance standards. Every day, Azure needs to identify and delete blobs that were NOT modified during the last 100 days. Solution: You schedule an Azure Data Factory pipeline with a delete activity. Does this meet the goal? a) Yes b) No

Answer: b Explanation: The solution is to apply an Azure Blob storage lifecycle policy.

133) You have an Azure Storage account and a data warehouse in Azure Synapse Analytics in the UK South region. You need to copy blob data from the storage account to the data warehouse by using Azure Data Factory. The solution must meet the following requirements: - Ensure that the data remains in the UK South region at all times - Minimize administrative effort Which type of integration runtime should you use? a) Azure integration runtime b) Azure SSIS integration runtime c) Self-hosted integration runtime

Answer: a

134) You want to ingest data from a SQL Server database hosted on an on-premises Windows Server. What integration runtime is required for Azure Data Factory to ingest data from the on-premises server? a) Azure integration runtime b) Azure SSIS integration runtime c) Self-hosted integration runtime

Answer: c

135) You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore index and will include the following columns: - TransactionType: 40 million rows per transaction type - CustomerSegment: 4 million per customer segment - TransactionMonth: 65 million rows per month - AccountType: 500 million per account type You have the following query requirements: - Analysts will most commonly analyze transactions for a given month - Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type You need to recommend a partition strategy for the table to minimize query times. On which column should you recommend partitioning the table? a) CustomerSegment b) AccountType c) TransactionType d) TransactionMonth

Answer: d

136) Your company wants to route data rows to different streams based on matching conditions. Which transformation in the Mapping Data Flow should you use? a) Conditional Split b) Select c) Lookup

Answer: a

137) A company has a real-time data analysis solution that is hosted on Microsoft Azure. The solution uses Azure Event Hub to ingest data and an Azure Stream Analytics cloud job to analyze the data. The cloud job is configured to use 120 streaming units (SUs). You need to optimize performance for the Azure Stream Analytics job. Which TWO actions should you perform? Each correct answer presents part of the solution. a) Implement Event Ordering b) Implement Azure Stream Analytics user-defined functions (UDF) c) Implement query parallelization by partitioning the data output d) Scale the SU count for the job up e) Scale the SU count for the job down f) Implement query parallelization by partitioning the data input

Answer: d, f

138) By default, how are corrupt records dealt with using spark.read.json()? a) They appear in a column called "_corrupt_record" b) They get deleted automatically c) They throw an exception and exit the read operation

Answer: a

139) How do you specify parameters when reading data? a) Using .option() b) Using .parameter() during you read allows you to pass key/value pairs specifying aspects of your read c) Using .keys() during your read allows you to pass key/value pairs specifying aspects of your read

Answer: a

140) You create an Azure Databricks cluster and specify an additional library to install. When you attempt to load the library to a notebook, the library is not found. You need to identify the cause of the issue. What should you review? a) Notebook logs b) cluster event logs c) global init scripts logs d) workspace logs

Answer: b

141) You are designing an Azure Databricks interactive cluster. You need to ensure that the cluster meets the following requirements: - Enable auto-termination - Retain cluster configuration indefinitely after cluster termination What should you recommend? a) Start the cluster after it is terminated b) Pin the cluster c) Clone the cluster after it is terminated d) Terminate the cluster manually at process completion

Answer: b

142) You are designing an Azure Databricks table. The table will ingest an average of 20 million streaming events per day. You need to persist the events in the table for use in incremental load pipeline jobs in Azure Databricks. The solution must minimize storage costs and incremental load times. What should you include in the solution? a) Partition by DateTime fields b) Sink to Azure Queue storage c) Include a watermark column d) Use a JSON format for physical data storage

Answer: b

143) You are planning a solution to aggregate streaming data that originates in Apache Kafka and is output to Azure Data Lake Storage Gen2. The developers who will implement the stream processing solution use Java. Which service should you recommend using to process the streaming data? a) Azure Event Hubs b) Azure Data Factory c) Azure Stream Analytics d) Azure Databricks

Answer: d

144) Which Azure Data Factory process involves using compute services to produce data to feed production environments with cleansed data? a) Connect and collect b) Transform and enrich c) Publish d) Monitor

Answer: b

145) You need to schedule an Azure Data Factory pipeline to execute when a new file arrives in an Azure Data Lake Storage Gen2 container. Which type of trigger should you use? a) On-demand b) Tumbling window c) Schedule d) Event

Answer: d Explanation: A file that arrives in ADLS is an event

146) You have two Azure Data Factory instances named ADFdev and ADFprod. ADFdev connects to an Azure DevOps Git repository. You publish changes from the main branch of the Git repository to ADFdev. You need to deploy the artifacts from ADFdev to ADFprod. What should you do first? a) From ADFdev, modify the Git configuration b) From ADFdev, create a linked service c) From Azure DevOps, create a release pipeline d) From Azure DevOps, update the main branch

Answer: c

147) You have an Azure Data Factory. You need to examine the pipeline failures from the last 60 days. What should you use? a) the Activity log blade for the Data Factory resource b) The Monitor and Manage app in Data Factory c) The resource health blade for the Data Factory resource d) Azure Monitor

Answer: d

148) You have an Azure Synapse Analytics dedicated SQL pool. You need to ensure that data in the pool is encrypted at rest. The solution must NOT require modifying applications that query the data. What should you do? a) Enable encryption at rest for the Azure Data Lake Storage Gen2 account b) Enable Transparent Data Encryption (TDE) for the pool c) Use a customer-managed key to enable double encryption for the Azure Synapse workspace d) Create an Azure key vault in the Azure subscription grant access to the pool

Answer: b Explanation: TDE helps protect against the threat of malicious activity by encrypting and decrypting your data at rest. When you encrypt your database, associated backups and transaction log files are encrypted without requiring any changes to your applications.

149) You plan to create an Azure Synapse Analytics dedicated SQL pool. You need to minimize the time it takes to identify queries that return confidential information as defined by the company's data privacy regulations and the users who executed the queues. Which TWO components should you include in the solution? a) sensitivity classification labels applied to columns that contain confidential information b) resource tags for databases that contain confidential information c) audit logs sent to a Log Analytics workspace d) dynamic data masking for columns that contain confidential information

Answer: a, d

150) While using Azure Data Factory, you want to parameterize a linked service and pass dynamic values at run time. Which supported connector should you use? a) Azure Data Lake Storage Gen2 b) Azure Data Factory variables c) Azure Synapse Analytics d) Azure Key Vault

Answer: c (also b??)

151) Which property indicates the parallelism, you want the copy activity to use? a) parallelCopies b) stagedCopies c) multiCopies

Answer: a

152) Using the Azure Data Factory user interface (UX) you want to create a pipeline that copies and transforms data from an Azure Data Lake Storage Gen2 source to an ADLS Gen2 sink using mapping data flow. Choose the correct steps in the right order: a) Create a data factory account b) Create a data factory c) Create a copy activity d) Create a pipeline with a Data Flow activity e) Validate copy activity f) Build a mapping data flow with four transformations g) Test run the pipeline h) Monitor a data flow activity

Answer: b-d-f-g-h

153) In Azure Data Factory: What is an example of a branching activity used in control flows? a) The If-condition b) Until-condition c) Lookup-condition

Answer: a

154) Which activity can retrieve a dataset from any of the data sources supported by data factory and Synapse pipelines? a) Find activity b) Lookup activity c) Validate activity

Answer: b

155) You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool. Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for use in inventory reports. The inventory reports will use the data and additional WHERE parameters depending on the report. The reports will be produced once daily. You need to implement a solution to make the dataset available for the reports. The solution must minimize query times. What should you implement? a) an ordered clustered columnstore index b) a materialized view c) result set caching d) a replicated table

Answer: b

156) You have an Azure Subscription that contains an Azure Storage account. You plan to implement changes to a data storage solution to meet regulatory and compliance standards. Every day, Azure needs to identify and delete blobs that were NOT modified during the last 100 days. Solution: You apply an expired tag to the blobs in the storage account. Does this meet the goal? a) Yes b) No

Answer: b

157) You have an Azure storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75 % of the rows contain description data that has an average length of 1.1 MB. You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics. You need to prepare the files to ensure that the data copies quickly. Solution: You copy the files to a table that has a columnstore index. Does this meet the goal? a) Yes b) No

Answer: b Explanation: Correct answer is to covert the files to compressed delimited text files

158) You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1. You plan to create a database named DB1 in Pool1. You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL pool. Which format should you use for the tables in DB1? a) CSV b) ORC c) JSON d) Parquet

Answer: d

159) You are designing a dimension table in Azure Synapse Analytics dedicated SQL pool. You need to create a surrogate key for the table. The solution must provide the fastest query performance. What should you use for the surrogate key? a) a GUID column b) a sequence object c) an IDENTITY column

Answer: c

160) You are implementing a batch dataset in Parquet format. Data files will be produced by using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure Synapse Analytics serverless SQL pool. You need to minimize storage costs for the solution. What should you do? a) Use a snappy compression for the files b) Use OPENROWSONSET to query the Parquet files c) Create an external table that contains a subset of of columns from the Parquet files d) Store all data as string in the Parquet files

Answer: c