Certification Flashcards by Josh Adkins

You are developing an Azure app named App1 that will store job candidate data. App1 will be deployed to three Azure regions and store a resume and five photos for each candidate.

You need to design a partition solution for App1. The solution must meet the following requirements:

The time it takes to retrieve the resume files must be minimized.
Candidate data must be stored in the same region as the candidate.
What should you include in the solution?

multiple storage account with two containers per account

How well did you know this?

Not at all

Perfectly

You have 100 retail stores distributed across Asia, Europe, and North America.

You are developing an analytical workload using Azure Stream Analytics that contains sales data for stores in different regions. The workload contains a fact table with the following columns:

Date: Contains the order date
Customer: Contains the customer ID
Store: Contains the store ID
Region Contains the region ID
Product: Contains the product ID
Price: Contains the unit price per product
Quantity: Contains the quantity sold
Amount: Contains the price multiplied by quantity
You need to design a partition solution for the fact table. The solution must meet the following requirements:

Optimize read performance when querying sales data for a single region in a given month.
Optimize read performance when querying sales data for all regions in a given month.
Minimize the number of partitions.

Which column should you use for partitioning?
- Region
- Time
- Product
- Customer
- Price

Product - Its sales data in a single region and all regions but it also asks to minimize partitions so product is probably the most granular you can go.

How well did you know this?

Not at all

Perfectly

You are importing data into an Azure Synapse Analytics database. The data is being inserted by using PolyBase.

You need to maximize network throughput for the import process.

What should you use?
-Sharding
-Vertical Partitioning
-Horizontal Partitioning
-Functional Partitioning

Shard the source data across multiple files. Sharding the source data into multiple files will increase the amount of bandwidth available to the import process.

How well did you know this?

Not at all

Perfectly

You have an app named App1 that contains two datasets named dataset1 and dataset2. App1 frequently queries dataset1. App1 infrequently queries dataset2.

You need to prevent queries to dataset2 from affecting the buffer pool and aging out the data in dataset1.

Which type of partitioning should you use?
-Vertical Partitioning
-Horizontal Partitioning
-Functional Partitioning

vertical - By using vertical partitioning, different parts of the database can be isolated from each other to improve cache use.

How well did you know this?

Not at all

Perfectly

You are designing a database solution that will host data for multiple business units.

You need to ensure that queries from one business unit do not affect the other business units.

Which type of partitioning should you use?
-Vertical Partitioning
-Horizontal Partitioning
-Functional Partitioning

functional - By using functional partitioning, different users of the database can be isolated from each other to ensure that one business unit does not affect another business unit.

How well did you know this?

Not at all

Perfectly

You want to query the first 100 rows of a Parquet file in Azure Synapse using SQL serverless pool. How would this look?

SELECT TOP 100 * FROM OPENROWSET(BULK ‘https://app1synstg.dfs.core.windows.net/data/Data.parquet’, FORMAT = ‘PARQUET’) AS result;

How well did you know this?

Not at all

Perfectly

You need to create a one-to-many relationship between tables in a Retail and Business Metrics database. What would you do first?
- Join Tables
- Create a Database
- Create a Schema
- Map the columns

Create a database.

How well did you know this?

Not at all

Perfectly

You create a Microsoft Purview account and add an Azure SQL Database data source that has data lineage scan enabled.

You assign a managed identity for the Microsoft Purview account and the db_owner role for the database.

After scanning the data source, you are unable to obtain any lineage data for the tables in the database.

You need to create lineage data for the tables.

What should you do?
-Use a SQL Authentication
-Create a Master Key in the Database
-Use a user-managed service principle

Create a master key in the database.

You need a master key in the Azure SQL database for lineage to work.

Using SQL authentication will just change the way data lineage scan enables Microsoft Purview to authenticate to the data source.

Using a user-managed service principal just changes the way Microsoft Purview authenticates to the data source. You do not need a certificate, but a master key.

How well did you know this?

Not at all

Perfectly

You need a fast way to design a healthcare provider’s database solution.
-Use a ARM Template
-Use a Azure Synapse Analytics Template

Azure Synapse Analytics database templates.

How well did you know this?

Not at all

Perfectly

You need to run an HDInsight Hive script in Azure Data Factory and output to a specific storage folder.
What url would you use to reference the account?
-wasb://data@datastg.blob.core.windows.net/devices/
-http://datastg.blob.core.windows.net/devices/

wasb://data@datastg.blob.core.windows.net/devices/

How well did you know this?

Not at all

Perfectly

You need to denormalize data in Azure Synapse Analytics by substituting product IDs with product names
-derived column.
-lookup

Lookup

How well did you know this?

Not at all

Perfectly

You need to write output from a select task to multiple sinks in Azure Data Factory.
-conditional split
-new branch

New Branch

How well did you know this?

Not at all

Perfectly

You plan to build an event processing solution.

You need to ensure that the solution will support real-time processing and batch processing of events.

Which two services should you include in the solution? Each correct answer presents part of the solution.
- Event Hubs
- Stream Analytics
- Synapse
- Data Factory

Event Hubs and Azure Stream Analytics

How well did you know this?

Not at all

Perfectly

You have an Azure Synapse Analytics workspace named workspace1.

You plan to write new data and update existing rows in workspace1.

You create an Azure Synapse Analytics sink to write the processed data to workspace1.

You need to configure the writeBehavior parameter for the sink. The solution must minimize the number of pipelines required.

What should you use?
- Merge
- Insert
- Update
- Upsert

Upsert

How well did you know this?

Not at all

Perfectly

You have an Azure Data Lake Storage account named store.dfs.core.windows.net and an Apache Spark notebook named Notebook1.

You plan to use Notebook1 to load and transform data in store.dfs.core.windows.net.

You need to configure the connection string for Notebook1.

Which URI should you use?
-abfss://container@store.dfs.core.windows.net/products.csv-wasb://container@store.dfs.core.windows.net/products.csv

abfss://container@store.dfs.core.windows.net/products.csv`
To access Data Lake Storage from a Spark notebook, use the Azure Blob Filesystem driver (ABFS).

THE KEY HERE IS AZURE DATA LAKE STORAGE, not blob storage specifically.

How well did you know this?

Not at all

Perfectly

You have an Azure subscription that contains an Azure Synapse Analytics workspace.

You use the workspace to perform ELT activities that can take up to 30 minutes to complete.

You develop an Azure function to stop the compute resources used by Azure Synapse Analytics during periods of zero activity.

You notice that it can take more than 20 minutes for the compute resources to stop.

You need to minimize the time it takes to stop the compute resources. The solution must minimize the impact on running transactions.

How should you change the function?
- Check the sys.dm_operation_status dynamic management view until no transactions are active in the database before stopping the compute resources.
- Add a 20 minute timer
- Set the database to read only before stopping the compute resources

Check the sys.dm_operation_status dynamic management view until no transactions are active in the database before stopping the compute resources.

Checking the sys.dm_operation_status dynamic management view until no transactions are active in the database before stopping the compute resources ensures that any running transaction will finish before stopping the computer nodes. If you stop the node while a transaction is running, the transaction will be rolled back, which can take time to occur.

How well did you know this?

Not at all

Perfectly

You have an Azure subscription that contains a Delta Lake solution. The solution contains a table named employees.

You need to view the contents of the employees table from 24 hours ago. You must minimize the time it takes to retrieve the data.

What should you do?
- Time Stamp as Of
- Version as Of

Query the table by using TIMESTAMP AS OF.

How well did you know this?

Not at all

Perfectly

You are developing an Azure Databricks solution.

You need to ensure that workloads support PyTorch code. The solution must minimize costs.

Which workload persona should you use?
-Data Engineering
-Machine Learning
-Databricks SQL

Machine Learning

How well did you know this?

Not at all

Perfectly

You are writing a data import task in Azure Data Factory.

You need to increase the number of rows per call to the REST sink

What should you change?
- writeBatchSize to 1,000
- writeBatchSize to 10,000
- writeBatchSize to 100,000

writeBatchSize to 100,000
To increase the number of records per batch, we need to increase the writeBatchSize. The default value for this parameter is 10,000, so to increase this we need to use a value that is higher than the default.

How well did you know this?

Not at all

Perfectly

You have an Azure Stream Analytics solution that receives data from multiple thermostats in a building.

You need to write a query that returns the average temperature per device every five minutes for readings within that same five minute period.

Which two windowing functions could you use?
Tumbling
Hopping
Sliding
Snapshot

Tumbling Window & Hopping Window
Tumbling windows have a defined period and can aggregate all events for that same time period. A tumbling window is essentially a specific case of a hopping window where the time period and the event aggregation period are the same.

Hopping windows have a defined period and can aggregate the events for a potentially different time period

Sliding windows are used to create aggregations for so many events, not at identical timelapses.

Snapshot windows aggregate all events with the same timestamp.

How well did you know this?

Not at all

Perfectly

You are building a real-time streaming process in Azure Data Factory.

You need to aggregate the data being processed by the stream.

Which stage of the integration pattern should you configure?
Extract
Transform
Load

Transform

How well did you know this?

Not at all

Perfectly

You have an Azure Data Factory pipeline named Pipeline1.

You need to ensure that Pipeline1 runs when an email is received.

What should you use to create the trigger?
Azure Data Factory
Azure Logic App

An Azure Logic App

How well did you know this?

Not at all

Perfectly

You have an Azure Data Factory pipeline named Pipeline1. Pipeline1 executes many API write operations every time it runs. Pipeline1 is scheduled to run every five minutes.

After executing Pipeline1 10 times, you notice the following entry in the logs.

Type=Microsoft.DataTransfer.Execution.Core.ExecutionException,Message=There are substantial concurrent MappingDataflow executions which is causing failures due to throttling under Integration Runtime ‘AutoResolveIntegrationRuntime’.

You need to ensure that you can run Pipeline1 every five minutes.

What should you do?
-Create a new integration runtime and a new Pipeline as a copy of Pipeline1. Configure both pipelines to run every 10 minutes, five minutes apart.
-Change the compute size
-Add a second trigger setting each to run every 10 minute, 5 minutes apart

Create a new integration runtime and a new Pipeline as a copy of Pipeline1. Configure both pipelines to run every 10 minutes, five minutes apart.

The throttling issue is showing that the auto resolve integration issue it the bottle neck - adding more could be a solution.

How well did you know this?

Not at all

Perfectly

You have an Azure Data Factory pipeline named Pipeline1. Pipeline1 includes a data flow activity named Dataflow1. Dataflow1 uses a source named source1. Source1 contains 1.5 million rows.

Dataflow1 takes 20 minutes to complete.

You need to debug Pipeline1. The solution must reduce the number of rows that flow through the activities in Dataflow1.

What should you do?
-Set the filter by last modified setting in source1
-Enable Sampling in source 1
-Enable staging in pipeline 1
-Add a new integration runtime for pipeline1

Enable sampling in source1.
Enabling sampling in source1 allows you to specify how many rows to retrieve.

How well did you know this?

Not at all

Perfectly

You are testing a change to an Azure Data Factory pipeline. You need to check the change into source control without affecting other users’ work in the data factory. What should you do?

Save the change to a forked branch in the source control project.

You have an Azure Synapse Analytics data pipeline. You need to run the pipeline at scheduled intervals. What should you configure? A Trigger A Schedule A Debug Run

A trigger

You are developing an Apache Spark pipeline to transform data from a source to a target. You need to filter the data in a column named Category where the category is cars. Which command should you run?

df.select("ProductName", "ListPrice").where((df["Category"] == "Cars") The correct format of the where statement is putting .where after the select statement.

You have a database named DB1 and a data warehouse named DW1. You need to ensure that all changes to DB1 are stored in DW1.The solution must capture the new value and the existing value and store each value as a new record. What should you include in the solution? -Change Data Capture -Change Data Feed -Change Tracking

change data capture The key here is that the solution must CAPTURE the new value, not just identify the change.

You have a database named DB1 and a data warehouse named DW1. You need to ensure that all changes to DB1 are stored in DW1. The solution must meet the following requirements: Identify that a row has changed, but not the final value of the row. Minimize the performance impact on the source system. What should you include in the solution?

Change Tracking Change tracking captures the fact that a row was changed without tracking the data that was changed. Change tracking requires less server resources than change data capture.

You plan to configure an Azure Stream Analytics job named Job1. You need to identify which components Job1 requires to perform event processing and analyze streaming data. Which three components should you identify? Each correct answer presents part of the solution.

A Query A Input A Output

You have an Azure subscription that contains an Azure Stream Analytics solution. You need to write a query that calculates the average rainfall per hour. The solution must segment the data stream into a contiguous series of fixed-size, non-overlapping time segments. Which windowing function should you use? -hopping -tumbling -sliding -snapshot

tumbling - tumbling window functions segment a data stream into a contiguous series of fixed-size, non-overlapping time segments.

You use an Azure Databricks pipeline to process a stateful streaming operation. You need to reduce the amount of state data to improve latency during a long-running steaming operation. What should you use in the streaming DataFrame? -Watermark -Partition

Watermarks - Watermarks interact with output modes to control when data is written to the sink. Because watermarks reduce the total amount of state information to be processed, effective use of watermarks is essential for efficient stateful streaming throughput.

You have an Azure Data Factory pipeline that includes two activities named Act1 and Act2. Act1 and Act2 run in parallel. You need to ensure that Act2 will only run once Act1 completes. Which dependency condition should you configure? -Failure on Act 1 -Success on Act 1 -Completed on Act 1

Completed

You need to implement encryption at rest by using transparent data encryption (TDE). You implement a master key. What should you do next? - Create a certificate that is protected by the master key - Backup the Master Database - Create Data Encryption

Create a certificate that is protected by the master key.

You are implementing an application that queries a table named Purchase in an Azure Synapse Analytics Dedicated SQL pool. The application must show data only for the currently signed-in user. You use row-level security (RLS), implement a security policy, and implement a function that uses a filter predicate. Users in the marketing department report that they cannot see their data. What should you do to ensure that the marketing department users can see their data? - Add a blocking predicate - Grant SELECT permissions on the purchase table to all marketing users - Grant SELECT permission to the function - Rebuild the Schema

Grant the SELECT permission on the Purchase table to the Marketing users

You use Azure Data Factory to connect to a notebook that runs in an Azure Databricks cluster. The connection is set to use access tokens. You need to revoke a user’s token. What should you use? -Token Management API 2.0 -Admin Console -Conditional Access Restriction Policies -IAM permission Adjustments

Token Management API 2.0 As the connection is set to use Access tokens, not IAM.

You have an Azure subscription that contains the following resources: An Azure Synapse Analytics workspace named workspace1 A virtual network named VNet1 that has two subnets named sn1 and sn2 Five virtual machines that are connected to sn1 You need to ensure that the virtual machines can connect to workspace1. The solution must prevent traffic from the virtual machines to workspace1 from traversing the public internet. What should you create? - Network Peering - Application Gateway - Private Endpoint - Service endpoint

Private Endpoint

You have an Azure Synapse Analytics workspace. You need to measure the performance of SQL queries running on the dedicated SQL pool. Which two actions achieve the goal? Each correct answer presents a complete solution - From the Monitor page of Azure Synapse Studio, review the SQL requests tab. - Query the sys.dm_pdw_exec_request view. - Add a Service endpoints - Configure Network peering,

- From the Monitor page of Azure Synapse Studio, review the SQL requests tab. - Query the sys.dm_pdw_exec_request view.

You have an Azure Synapse Analytics workspace. You need to configure the diagnostics settings for pipeline runs. You must retain the data for auditing purposes indefinitely and minimize costs associated with retaining the data. Which destination should you use? - Azure Monitor Log - Archive to a storage account - Send the Data to a Data Partner

Archive to a storage account.

You have an Azure Synapse Analytics workspace. You need to monitor bottlenecks related to the SQL Server OS state on each node of the dedicated SQL pools. Which view should you use? -sys.dm_pdw_wait -sys.dm_pdw_wait_stats -sys.dm_pdw_nodes

sys.dm_pdw_wait_stats

You have an Azure Data Factory named ADF1. You need to ensure that you can analyze pipeline runtimes for ADF1 for the last 90 days. What should you use? -Azure Monitor -Data Factory

Azure Monitor Data Factory only stores pipeline runtimes for 45 days. To view the data for a longer period, that data must be sent to Azure Monitor, where the information can then be retrieved and viewed.

You have an Azure Data Factory named ADF1. You need to review Data Factory pipeline runtimes for the last seven days. The solution must provide a graphical view of the data. What should you use?

the Gantt view of the pipeline runs

You have an Apache Spark pool in Azure Synapse Analytics. You run a notebook that creates a DataFrame containing a large amount of data. You need to preserve the DataFrame in memory. Which two transformations can you use? Each correct answer presents a complete solution. -write() -take() -cache() -persist()

cache() persist()

You have an Azure Databricks cluster that uses Databricks Runtime 10.1. You need to automatically compact small files for creating new tables, so that the target file size is appropriate to the use case. What should you set? -`delta.autoOptimize.autoCompact = auto` -`delta.autoOptimize.autoCompact = true` - `delta.autoOptimize.autoCompact = false`

You should use delta.autoOptimize.autoCompact = auto because it compacts the files to the size that is appropriate to the use case. delta.autoOptimize.autoCompact = true and delta.autoOptimize.autoCompact = legacy compact the files to 128 MB. delta.autoOptimize.autoCompact = false disables automated file compaction.

You need to store information about failed Azure Data Factory pipelines for three months. Which three actions should you perform? Each correct answer presents part of the solution. - From the Monitor page of Azure Synapse Studio, review the Pipeline runs tab. - Create a Log Analytics workspace. - Add diagnostic settings and add Log Analytics as a target. - Create a storage account that has a lifecycle policy.

- Create a Log Analytics workspace. - Add diagnostic settings and add Log Analytics as a target. - Create a storage account that has a lifecycle policy. A Data Factory pipeline stores monitoring data for 45 days. To keep data for longer, you need to create a Log Analytical workspace and send the data to an Azure Storage account.

You monitor an Azure Data Factory pipeline that occasionally fails. You need to implement an alert that will contain failed pipeline run metrics. The solution must minimize development effort. Which two actions achieve the goal? Each correct answer presents a complete solution. - Create a webhook activity - From Azure portal, create an alert and add the metrics. -From the Monitor page of Azure Data Factory Studio, create an alert.

- From Azure portal, create an alert and add the metrics. -From the Monitor page of Azure Data Factory Studio, create an alert.

You have an Azure Synapse Analytics workspace that includes a table named Table1. You are evaluating the use of a clustered columnstore index. What is the minimum recommended number of rows for clustered columnstore indexes? -600 thousand -6 million -60 million -600 million

60 million Clustered columnstore indexes work on segments of 1,048,576 rows. As Azure Synapse Analytics has 60 nodes per distribution, the minimum recommended number of rows for a clustered columnstore index is 60,000,000.

You have an Azure Synapse Analytics pipeline connected to an Azure SQL database. You need to use data masking to obfuscate data and limit its exposure in lower-level environments. Which three options can be used for dynamic data masking? Each correct answer presents a complete solution. - Custom String - Default - Number - Always Encrypted

- Custom String - Default - Number

You have an Azure subscription that contains an Azure SQL database named DB1. You need to implement row-level security (RLS) for DB1. The solution must block users from updating rows with values that violate RLS. Which block predicate should you use? - After Update - Before Update - After Insert

AFTER UPDATE AFTER UPDATE prevents users from updating rows to values that violate the predicate. AFTER INSERT prevents users from inserting rows that violate the predicate. BEFORE UPDATE prevents users from updating rows that currently violate the predicate. Blocks delete operations if the row violates the predicate

You have an Azure Synapse Analytics SQL pool. You need to monitor currently-executing query executions in the SQL pool. Which three dynamic management views should you use? Each correct answer presents part of the solution. -sys.dm_pdw_exec_requests -sys.dm_pdw_request_steps -sys.dm_pdw_sql_requests -sys.dm_exec_cached_plans

-sys.dm_pdw_exec_requests -sys.dm_pdw_request_steps -sys.dm_pdw_sql_requests

You plan to deploy an app that will distribute files across multiple Azure Storage accounts. You need to recommend a partitioning strategy that meets the following requirements: Optimizes the data distribution balance. Minimizes the creation of extra tables. What should you recommend? - Hash - Lookup - Sharding

Hash - Hash partitioning is optimized for data distribution and uses a hash function to eliminate the need for a lookup table.

You are evaluating the use of Azure Data Lake Storage Gen2. What should you consider when choosing a partitioning strategy? - access policies - File Size - Data Residency

Data Residency - Data residency must be considered to identify whether different datasets can only exist in specific regions.

You have an Azure Synapse Analytics database named DB1. You need to import data into DB1. The solution must minimize Azure Data Lake Storage transaction costs. Which design pattern should you use? - Read Access Geo Redundant Storage (RA-GRS) storage account - Store the data in 500 mb files - Store the data in 5byte files

Store the data in 500-MB files. - By using larger files when importing data, transaction costs can be reduced. This is because the reading of files is billed with a 4-MB operation, even if the file is less than 4 MB. To reduce costs, the entire 4 MB should be used per read.

You have a solution that transforms data in an ELT process. The solution includes a column named fulladdress that contains address data in the following format: [street address], [city] [state] [postal code]. You need to retrieve the street address. Which T-SQL statement should you use? - SUBSTRING(fulladdress,1,CHARINDEX(',',fulladdress)) - LEFT(fulladdress,CHARINDEX(',',fulladdress)-1)

LEFT(fulladdress,CHARINDEX(',',fulladdress)-1) This extracts just the street address without the comma.

You have an Azure Data Factory pipeline that uses Apache Spark to transform data. You need to run the pipeline. Which PowerShell cmdlet should you run?

`Invoke-AzDataFactoryV2Pipeline`- AZURE IS REPRESENTED AS AZ in powershell

You have an Azure subscription that contains an Azure Synapse Analytics Dedicated SQL pool named Pool1. Pool1 hosts a table named Table1. You receive JSON data from an external data source. You need to store the external data in Table1. Which T-SQL element should you use? - From JSON - OPENJSON

The OPENJSON command converts a JSON document into table format.

You have a solution that upserts data to a table in an Azure Synapse Analytics database. You need to write a single T-SQL statement to upsert the data. Which T-SQL command should you run? - Merge - Update - Insert - Select Into

MERGE allows for INSERT of non-matched values and UPDATE of matched values.

You design an Azure Data Factory data flow activity to move large amounts of data from text files to an Azure Synapse Analytics database. You add a data flow script to your data flow. The data flow in the designer has the following tasks: distinctRows1: Aggregate data by using myCols that produce columns. source1: Import data from DelimitedText1. derivedColumn1: Create and update the C1 columns. select1: Rename derivedColumn1 as select1 with columns C1. sink1: Add a sink dataset. You need to ensure that all the rows in source1 are deduplicated. What should you do? -Change the incoming stream for derivedColumn1 to distinctRows1. -Change the incoming stream for distinctRows1 to source1.

Change the incoming stream for distinctRows1 to source1. Changing the incoming stream for distinctRows1 to source1 will move the dedupe script right after source1, and only retrieve distinct rows.

You design an Azure Data Factory pipeline that has a data flow activity named Move to Synapse and an append variable activity named Upon failure. Upon failure runs upon the failure of Move to Synapse. You notice that if the Move to Synapse activity fails, the pipeline status is successful. You need to ensure that if Move to Synapse fails, the pipeline status is failed. The solution must ensure that Upon Failure executes when Move to Synapse fails. What should you do?

Add a new activity with a Success predecessor to Move to Synapse. Adding a new activity with a Success predecessor to Move to Synapse will ensure that the pipeline is marked as failed when the data flow fails.

You have a Delta Lake solution that contains a table named table1. You need to roll back the contents of table1 to 24 hours ago. Which command should you run?

RESTORE TABLE employee TO TIMESTAMP AS OF current_timestamp() - INTERVAL '24' HOUR; -- RESTORE TABLE employee TO TIMESTAMP AS OF current_timestamp() - INTERVAL '24' HOUR; restores the table to 24 hours ago. VACUUM employee RETAIN 24; moves unused files from the delta folder. COPY INTO employee1 copies data to a new table.

You have an Azure Databricks cluster. You need to stage files into the shared cluster storage by using a third-party tool. Which file system should the tool support?

DBFS - Databricks shared storage, which all the nodes of the cluster can access, is built and formatted by using DBFS.

You have an Azure Stream Analytics solution that receives data from multiple thermostats in a building. You need to write a query that returns the average temperature per device every five minutes for readings within that same five minute period. Which two windowing functions could you use?

Tumbling Window or Hopping Window

You create an Azure Stream Analytics job. You run the job for five hours. You review the logs and notice multiple instances of the following message. {"message Time":"2019-02-04 17:11:52Z","error":null, "message":"First Occurred: 02/04/2019 17:11:48 | Resource Name: ASAjob | Message: Source 'ASAjob' had 24 data errors of kind 'LateInputEvent' between processing times '2019-02-04T17:10:49.7250696Z' and '2019-02-04T17:11:48.7563961Z'. Input event with application timestamp '2019-02-04T17:05:51.6050000' and arrival time '2019-02-04T17:10:44.3090000' was sent later than configured tolerance.","type":"DiagnosticMessage","correlation ID":"49efa148-4asd-4fe0-869d-a40ba4d7ef3b"} You need to ensure that these events are not dropped. What should you do?

Increase the tolerance for late arrivals Increasing the tolerance for late arrivals ensures that late arrivals are not dropped. The error is about late arrivals, not out-of-order events.

Your company has a branch office that contains a point of sale (POS) system. You have an Azure subscription that contains a Microsoft SQL Server database named DB1 and an Azure Synapse Analytics workspace. You plan to use an Azure Synapse pipeline to copy CSV files from the branch office, perform complex transformations on their content, and then load them to DB1. You need to pass a subset of data to test whether the CSV columns are mapped correctly. What can you use to perform the test?

Correct: The Data Flow Debug option is available inside of a data flow activity and allows you to pass a subset of data through the flow, which can be useful to test whether columns are mapped correctly.

You are creating an Azure Data Factory pipeline. You need to store the passwords used to connect to resources. Where should you store the passwords?

Azure Key Vault Passwords for resources are not stored in the Data Factory pipeline. It is recommended that the passwords be stored in Key Vault so they can be stored securely.

You have 500 IoT devices and an Azure subscription. You plan to build a data pipeline that will process real-time data from the devices. You need to ensure that the devices can send messages to the subscription. What should you deploy? Azure Stream Analytics or Azure Event Hub

Azure Event Hub - To send real-time data from IoT devices to an Azure subscription, the messages are received by an event hub.

You are building an Azure Stream Analytics pipeline. You need to ensure that data in the pipeline is aggregated every five minutes. The aggregates should include all events in the previous ten minutes. Which windowing function should you use?

Hopping window groups allow you to group the data from one period every occurrence of another period.

You need to add permissions to an Azure Data Lake Storage Gen2 account that allows assigning POSIX access controls. Which role should you use?

Storage Blob Data Owner - Storage Blob Data Owner allows for full access to Azure Blob storage containers and data, including assigning POSIX access control.

You have an Azure Data Lake Storage Gen2 account. You grant developers Read and Write permissions by using ACLs to the files in the path \root\input\cleaned. The developers report that they cannot open the files. How should you modify the permissions to ensure that the developers can open the files?

Grant Execute permissions to all folders. If you are granting permissions by using only ACLs (not Azure RBAC), then to grant a security principal read or write access to a file, you will need to grant the security principal Execute permissions to the root folder of the container and to each folder in the hierarchy of folders that lead to the file.

You have an Azure Synapse Analytics workspace. You need to monitor bottlenecks related to the SQL Server OS state on each node of the dedicated SQL pools. Which view should you use? - sys.dm_pdw_wait_stats - sys.dm_pdw_os_threads - sys.dm_pdw_waits

sys.dm_pdw_wait_stats You should use the sys.dm_pdw_wait_stats view, as it holds information related to the SQL Server OS state related to instances running on the different nodes.

From the Monitor page of Azure Synapse Studio, review the SQL requests tab. Query the sys.dm_pdw_exec_request view.

You monitor an Azure Stream Analytics job and discover that the Backlogged Input Events metrics show non-zero values for the last few hours. What should you do to improve job performance without changing the query? - Increase the number of the Streaming Units (SU) - Repartition the input stream.

You should increase the number of SUs, as this adds more computational power to the job.

You have two Azure Data Factory pipelines. You need to monitor the runtimes of the pipelines. What should you do?

From the Data Factory blade of the Azure portal, review the Monitor & Manage tile. The runtimes of existing pipelines is available in the Azure portal, on the Data Factory blade, under the Monitor & Manage tile.

You have an Azure Data Factory named ADF1. You need to ensure that you can analyze pipeline runtimes for ADF1 for the last 90 days. What should you use?

You have an Azure Synapse Analytics workspace. You need to identify running sessions in the workspace. Which dynamics management view should you use? - sys.dm_pdw_exec_sessions - sys.dm_pdw_exec_requests - sys.dm_exec_requests - sys.dm_exec_sessions

sys.dm_pdw_exec_sessions sys.dm_pdw_exec_sessions shows the status of the sessions, not the running requests.

SELECT * FROM sys.dm_pdw_exec_requests WHERE label = 'query1'

You have an Azure Synapse Analytics workspace that includes an Azure Synapse Analytics cluster named Cluster1. You need to review the estimated execution plan for a query on a specific node of Cluster1. The query has a spid of 94 and a distribution ID of 5. Which command should you run? - SELECT * FROM sys.dm_exec_query_plan WHERE spid = 94 AND distribution_id = 5 - `DBCC PDW_SHOWEXECUTIONPLAN (5, 94)`

`DBCC PDW_SHOWEXECUTIONPLAN (5, 94)` The execution plan for the specific distribution is available by busing the DBCC PDW_SHOWEXECUTIONPLAN command.

You have an Azure Synapse Analytics database named DB1. You need to increase the concurrency available for DB1. Which cmdlet should you run? -Update-AzSynapseSqlDatabase -`Set-AzSqlDatabase`

`Set-AzSqlDatabase`-Increasing the concurrency on a database requires scaling the database up by using the Set-AzSqlDatabase cmdlet.

You are designing a data partitioning strategy for an Azure Synapse Analytics workload. You need to implement a distribution strategy that optimizes data load operations. Which type of distribution should you use? - Hash - Round Robin - Replicated Tables

Round Robin - Round robin is optimized for loading data.

You have an Azure Synapse Analytics database named DB1. You plan to import data into DB1. You need to maximize the performance of the data import. What should you implement? - Functional Partitioning - Vertical Partitioning - Horizontal Partitioning

Horizontal Partitioning - By using horizontal partitioning, you can improve the performance of the data load. As more server resources and bandwidth are available to the source files, the import process gets faster.

You have a data solution that includes an Azure SQL database named SQL1 and an Azure Synapse database named SYN1. SQL1 contains a table named Table1. Data is loaded from SQL1 to the SYN1. You need to ensure that Table1 supports incremental loading. What should you do? - Add a new column to track lineage - Create Foreign Keys - Data Classification

Add a new column to track lineage in Table1. A new column of type date or int can be used to track lineage in a table and be used for filtering during an incremental load.

You create a data flow activity in an Azure Synapse Analytics pipeline. You plan to use the data flow to read data from a fixed-length text file. You need to create the columns from each line of the text file. The solution must ensure that the data flow only writes three of the columns to a CSV file. Which three types of tasks should you add to the data flow activity? Each correct answer presents part of the solution. - Select - Sink - Derived Column - Flatten - Aggregate

- Select - Sink - Derived Column You need to use a derived column task to extract the columns from the line of text. Select takes just the value of the three columns you want to write to the CSV file. You need a sink to write the data to a CSV file. There is no data to aggregate. There is no need to flatten the data.

You have source data that contains an array of JSON objects. Each JSON object has a child array of JSON objects. You create a data flow activity in an Azure Synapse Analytics pipeline. You need to transform the source so that it can be written to an Azure SQL Database table where each row represents an element of the child array, along with the values of its parent element. Which type of task should you add to the data flow activity? -Parse -Flatten -Pivot -Unpivot

Flatten Flatten flattens JSON arrays. Parse parses data. Unpivot creates new rows based on the names of columns. Pivot creates new columns based on the values of rows.

You have an Azure Stream Analytics job named Job1. Job1 runs continuously and executes non-parallelized queries. You need to minimize the impact of Azure node updates on Job1. The solution must minimize costs. To what should you increase the Scale Units (SUs)? - 2 SUs - 12 SUs

Increasing the SUs to 12 still uses two nodes. The other options still use a single node that will stop for maintenance.

You are building a real-time streaming process in Azure Data Factory. You need to aggregate the data being processed by the stream. Which stage of the integration pattern should you configure? - Extract - Load - Transform

Transform - Data aggregation is done within the transform portion of the ETL process.

Which Azure Data Factory components should you use to connect to a data source?

a linked service

You have an Azure Data Factory pipeline named Pipeline1. Which two types of triggers can you use to start Pipeline1 directly? Each correct answer presents a complete solution. - Custom Event - Twitter Post - Tumbling Window - Sharepoint

- Custom Event - Tumbling Window Tumbling window is a valid type of trigger in Data Factory. Custom event is a valid type of trigger in Data Factory. You cannot use SharePoint to trigger a Data Factory pipeline directly. You can do it from a logic app. You cannot use Twitter to trigger a Data Factory pipeline directly. You can do it from a logic app.

You have a database named DB1 and a data warehouse named DW1. You need to ensure that all changes to DB1 are stored in DW1. The solution must meet the following requirements: Identify each row that has changed. Minimize the performance impact on the source system. What should you include in the solution? - Change Tracking - Change Data Capture

Change Tracking Change tracking captures the fact that a row was changed without tracking the data that was changed. Change tracking requires less server resources than change data capture.

You are building an Azure Stream Analytics pipeline. You need to configure the pipeline to analyze events that occur during a five-minute window after an event fires. Which windowing function should you use? -Tumbling -Hopping -Sliding

Sliding Window Sliding windows generate events for points in time during which the contents of the window changed. To limit the number of windows it needs to consider, Stream Analytics outputs events for only those points in time when an event entered or exited the window. As such, every window contains a minimum of one event. Events in sliding windows can belong to more than one sliding window, such as hopping windows.

- Completed Activities that are configured to run upon completion of the parent task will run whether the parent succeeds or fails.

You need to grant an Azure Active Directory user access to write data to an Azure Data Lake Storage Gen2 account. Which security technology should you use to grant the access? -IAM -RBAC

- RBAC Granting access to Data Lake Storage Gen2 is done through RBAC.

You need to limit sensitive data exposure to non-privileged users. You must be able to grant and revoke access to the sensitive data. What should you implement? -Row Level Security -Dynamic Data Masking

-Dynamic Data Masking Dynamic data masking helps prevent unauthorized access to sensitive data by enabling customers to designate how much of the sensitive data to reveal with minimal impact on the application layer. It is a policy-based security feature that hides the sensitive data in the result set of a query over designated database fields, while the data in the database is not changed.

You have an Azure Storage account named account1. You need to ensure that requests to account1 can only be made from specific domains. What should you configure? -CDN -CORS

By using CORS, you can specify which domains a web request is allowed to respond to. If the domain is not listed as an approved domain, the request will be rejected. https://learn.microsoft.com/training/modules/secure-azure-storage-account/2-storage-security-features?ns-enrollment-type=learningpath&ns-enrollment-id=learn.data-ai.data-processing-with-azure-adls

You monitor an Azure Stream Analytics job and discover that the Backlogged Input Events metrics show non-zero values for the last few hours. What should you do to improve job performance without changing the query? -Increase the number of the Streaming Units (SU). -Increase the settings for late events.

Increase the number of the Streaming Units (SU) You should increase the number of SUs, as this adds more computational power to the job. Increasing the late events settings will not solve the performance issue. Moving the job to the dedicated Stream Analytics cluster gives you more administrative access to components, such as like endpoints. Repartitioning the input streams requires you to change the query definition.

You have two Azure Data Factory pipelines. You need to monitor the runtimes of the pipelines. What should you do? - Track in azure data studio - From the Data Factory blade of the Azure portal, review the Monitor & Manage tile.

You have an Azure Data Factory named ADF1. You configure ADF1 to send data to Log Analytics in Azure-Diagnostics mode. You need to review the data. Which table should you query?

AzureDiagnostics When Data Factory is configured to send logging data to Log Analytics and is in Azure-Diagnostics mode, the data will be sent to the AzureDiagnostics table in Log Analytics. The ADFActivityRun, ADFPipelineRun, ADFSSISIntegrationRuntimeLogs, and ADFSSISPackageExecutableStatistics tables are used when the Data Factory is in Resource-Specific mode.

You monitor an Apache Spark job that has been slower than usual during the last two days. The job runs a single SQL statement in which two tables are joined. You discover that one of the tables has significant data skew. You need to improve job performance. Which hint should you use in the query? - Skew - Coalesce - Repartition - Rebalance

SKEW You should use the SKEW hint in the query. The COALESCE hint reduces the number of partitions to the specified number of partitions. The REPARTITION hint is used to specify the number of partitions using the specified partitioning expressions. The REBALANCE hint can be used to rebalance the query result output partitions, so that every partition is a reasonable size (not too small and not too big).

You have an Azure Synapse Analytics workspace. Users report that queries that have a label of ‘query1’ are slow to complete. You need to identify all the queries that have a label of ‘query1’. Which query should you run? - SELECT * FROM sys.dm_pdw_sql_requests WHERE label = 'query1' - SELECT * FROM sys.dm_pdw_exec_requests WHERE label = 'query1'

SELECT * FROM sys.dm_pdw_exec_requests WHERE label = 'query1' Labels for queries are available from sys.dm_pdw_exec_requests. Once the request IDs for the queries are identified, the request IDs can be used for the other dynamic management views.

You have an Azure Synapse Analytics database named DB1. You need to increase the concurrency available for DB1. Which cmdlet should you run? - Update-AzSynapseSqlDatabase - `Set-AzSqlDatabase`

Increasing the concurrency on a database requires scaling the database up by using the Set-AzSqlDatabase cmdlet.

Certification Flashcards

(99 cards)