DP-203 Dumps Flashcards
1) You execute the following query in Azure Synapse Analytics Spark pool in workspace for the following query:
SELECT StudentID
FROM abc.dbo.myTable
WHERE name = ‘Amit’
TABLE:
StudentName: Amit
StudentID: 69
StudentStartDate: 26/05/22
What will be the output of the query?
a) Amit
b) Error
c) 69
d) Null
Answer: b
Explanation: ‘name’ column does not exist
2) As a Data Engineer, you need to design an Azure Synapse Analytics dedicated SQL Pool which can meet the following goal:
- Return student records from a given point in time,
- Maintain current student information
How should you model the student data?
a) View
b) Temporal table
c) Slowly Changing Dimension (SCD) Type 2
d) SCD Type 7
Answer: c
Explanation: Can return information at a certain point-in-time incl. historical data
3) An Azure Data Factory pipeline has the following activities:
- Copy,
- Wrangling data flow,
- Jar,
- Notebooks
Which TWO Azure services should you use to debug the activities?
a) Computer Vision
b) Data Factory
c) Azure Sentinel
d) Azure Databricks
Answer: b,d
Explanation: Computer Vision has to do with AI and Azure Sentinel is a security configuration feature. Therefore, the logical answer(s) would be b and d.
4) A company needs to design an Azure Data Lake Storage solution which will include geo-zone-redundant storage (ZRS) for high availability.
What should you include in the monitoring solution for replication delays which can affect the recovery point objective (RPO)?
a) 4xx: Server error
b) Last sync time
c) Principle of least privilege
d) ARM template
Answer: b
Explanation: Options a, c, and d have nothing to do with storage redundancy or RPO…
5) An automobile company uses an Azure IoT Hub for communication with the IoT devices. What solution should you recommend if you want to monitor the devices in real-time?
a) Azure Data Factory using Visual Studio
b) Azure Stream Analytics job
c) Storage Account using Azure Powershell
d) Azure virtual machine using Azure Portal
Answer: b
Explanation: None of the other options have to do with IoT devices and/or monitoring in real-time.
6) A table will track the values of dimension attributes over the course of time and retain the history of the data by adding new rows as the data changes. Which Slowly Changing Dimension (SCD) type should you use?
a) Type -1
b) Type 1
c) Type 2
d) Type 3
Answer: c
7) A company needs to perform batch processing in Azure Databricks once per day. Which type of databricks cluster should you use?
a) Standard
b) Interactive
c) Automated
d) Manual
Answer: c
Explanation: Standard and Interactive don’t deal with batch processing. ‘Manual’ databricks cluster doesn’t exist.
8) A company is building streaming solutions in Azure Databricks. The solution needs to count events in 5 minute intervals and only report on events which arrive during the interval which will be sent to a Delta Lake table as an output. Which output mode should you use?
a) Complete
b) Partial
c) Append
d) Update
Answer: c
Explanation: Complete and Partial are not output modes. Update only deals with rows that have been changed since the last trigger.
9) A company has an Azure Data Lake Storage Gen2 account called CGAmit which is protected by virtual networks. You need to design an SQL pool in Azure Synapse which will use CGAmit as the source. What should you use to authenticate to CGAmit?
a) Azure Lock
b) Shared Access Signature (SAS)
c) Active Directory Federation Services (ADFS)
d) Managed Identity
Answer: d
Explanation:
Azure Lock deals with accidental deletion of resources.
SAS deals with providing secure delegated access to resources in the storage account. ADFS deals with SSO between internet-facing applications.
10) You need to recommend a solution when designing a database for an Azure Synapse Analytics dedicated SQL pool for transaction fraud which can meet the following requirements:
- Users should not be able to access the actual food card numbers
- Users should be able to use food cards as a feature in the models
What should you suggest?
a) Row-level-security (RLS)
b) Azure Active-Directory Pass-Through authentication
c) Transparent Data Encryption (TDE)
d) Column-level security
Answer: d
Explanation:
RLS is meant for restricting rows.
AADPT authentication is meant for authentication and not relevant here.
TDE encrypts the data but it also needs to be decrypted
11) You need to suggest which format to store the data in Azure Data Lake Storage Gen2 to support the reports. The solution should minimize read times.
- Read two columns from a file which contains 69 columns:
a) Parquet
b) TSV
c) AVRO
- Query one record based on timestamp:
a) Parquet
b) TSV
c) AVRO
Answer: a, c
12) As a data engineer, you need to aggregate data which originates in Kafka and is output to Azure Data Lake Storage Gen2. The testing team needs to implement the stream processing solution using Java.
Which service should you suggest to process the streaming data?
a) Azure Databricks
b) Azure Stream Analytics
c) Azure Sentinel
d) Azure Event Hub
Answer: a
Explanation:
Azure Sentinel and Azure Event Hub don’t processes streaming data. Further, Stream analytics doesn’t support Java (it uses SQL and JavaScript) and is therefore incorrect.
13) A production team needs a solution which can stream data to Azure Stream Analytics. The solution will be having reference data as well as streaming data. Which TWO input types should you use for reference data?
a) Azure DocumentDB
b) Azure Blob Storage
c) Azure Event Hub
d) Azure SQL Database
Answer: b, d
Explanation:
DocumentDB doesn’t support streaming data.
Azure Event Hub can store streaming data but incurs a higher cost than what we need.
14) You need to ensure that data in the Azure Synapse Analytics dedicated SQL pool is encrypted at rest. The solution should NOT modify applications which query the data. What should you implement?
a) Enable Transparent Data Encryption (TDE)
b) Upgrade to Premium P2 license
c) Create Azure functions
d) Use customer managed keys
Answer: a
Explanation:
Nothing in the question mentions licensing.
Azure Functions has nothing to do with encryption.
Customer managed keys are configured at the workspace level (deals with double-encryption).
15) As a data engineer, you need to suggest an Azure Databricks cluster configuration which can meet the following requirements:
- Minimize cost,
- Reduce query latency,
- Maximize the number of users that can execute queries on cluster simultaneously
Which cluster type should you suggest?
a) High concurrency cluster with auto termination
b) High concurrency cluster with autoscaling
c) Standard cluster with auto termination
d) Standard cluster with autoscaling
Answer: b
Explanation:
Standard cluster cannot share multiple tasks (such as autoscaling/termination)
High concurrency clusters cannot be terminated even if we use auto termination.
16) A company needs to trigger an Azure Data Factory pipeline as soon as a file arrives in an Azure Data Lake Storage Gen2 container. Which resource should you use?
a) Microsoft.EventGrid
b) Microsoft.EventHub
c) Microsoft.IoT
d) Microsoft.CosmosDB
Answer: a
Explanation:
The question doesn’t deal with real-time data therefore IoT/CosmosDB are incorrect.
EventHub deals with telemetry data.
EventGrid is natively integrated with Synapse/DF pipelines
17) As a data engineer, you need to make sure that you can audit access to Personally Identifiable Information (PII) while designing an Azure Synapse Analytics dedicated SQL pool. What should you include?
a) RLS
b) Column-level security
c) Security baseline
d) Sensitivity classifications
Answer: d
Explanation:
RLS is meant for restricting rows,
Column-level security is used to create a symmetric key to encrypt the data
Security-baseline provides guidance for database-level security recommendations. Nothing in the question is related to Azure SQL database.
18) You need to design a date dimension table in an Azure Synapse Analytics dedicated SQL pool. As per the business requirement, the date dimension table will be used by all fact tables. Which distribution type should you recommend to minimize data movement?
a) Hash
b) Asterisk
c) Replicate
d) Round robin
Answer: c
Explanation:
For FACT tables, Hash distribution is used.
For DIMENSION tables REPLICATE is used.
For STAGING tables, ROUND ROBIN is used.
There is no ASTERISK distribution type in Azure.
19) As a data engineer, you need to create a new notebook in Azure Databricks which will support Python as the primary language and should also support R and Scala. Which switch should you use to switch between the different languages?
a) %
b) #
c) @{}
d) @[]
Answer: a
20) A company has an Azure Synapse Analytics dedicated SQL pool which contains a huge fact table. The table contains 47 columns and 4.7 BN rows and is a heap. On average, queries against the table aggregate values from approximately 69 million rows and return only two columns. You notice that queries against the fact table are extremely slow. Which type of index should you add to provide the fastest query times?
a) Non-clustered column store
b) Clustered index
c) Semi-clustered index
d) Clustered column store
Answer: d
Explanation:
Non-clustered column store doesn’t exist in Synapse Analytics
Clustered index is best for tables with less than 60 million rows considering performance
Semi-clustered index doesn’t exist in Synapse Analytics
Clustered column store is usually the best choice for large heap tables.
21) An e-commerce company needs to make sure that an Azure Data Lake Storage Gen2 container is available for read workloads in a secondary region if an outage happens in the primary region. Which type of redundancy should you recommend so that your solution minimizes costs?
a) Geo-Zone-Redundant-Storage (G-ZRS)
b) Geo-Redundant-Storage (GRS)
c) Locally-Redundant-Storage (LRS)
d) Read-Access-Geo-Redundant-Storage (RA-GRS)
Answer: d
Explanation:
GRS doesn’t initiate automatic failover, and hence doesn’t meet the requirements.
LRS provides redundancy in a single region only.
22) As a data engineer, you need to configure an Azure Databricks workspace which is currently in the Standard pricing tier to support autoscaling all-purpose clusters. The solution should meet the following requirements:
- Reduce time taken to scale the number of workers while minimizing costs
- Automatically scale down workers when the cluster is underutilized for five minutes
What should be your first step?
a) Upgrade Azure Databricks workspace to Premium pricing tier
b) Create logic apps for the workspace
c) Enable a log analytics workspace
d) Create a storage account
Answer: a
23) A company uses Azure Stream Analytics to accept data from Azure Event Hubs and to output the data to an Azure Blob Storage account. As a data engineer, you need to output the count of records received from the last 7 minutes, every minute. Which window function should you use?
a) Sliding
b) Tumbling
c) Hopping
d) Snapshot
Answer: c
Explanation:
24) An Azure Data Factory pipeline needs to meet the following requirements:
- Support backfilling existing data in the source table
- Automatically retry execution if the pipeline fails due to throtling limits or concurrency
Which type of trigger should you recommend?
a) Schedule
b) Tumbling window
c) Hopping
d) Snapshot
Answer: b
Explanation:
Hopping/Snapshot doesn’t support retry executions and dealing with concurrency issues.
Schedule could be an option but Tumbling Window is better for setting policies.