Certification Flashcards
You are developing an Azure app named App1 that will store job candidate data. App1 will be deployed to three Azure regions and store a resume and five photos for each candidate.
You need to design a partition solution for App1. The solution must meet the following requirements:
The time it takes to retrieve the resume files must be minimized.
Candidate data must be stored in the same region as the candidate.
What should you include in the solution?
multiple storage account with two containers per account
You have 100 retail stores distributed across Asia, Europe, and North America.
You are developing an analytical workload using Azure Stream Analytics that contains sales data for stores in different regions. The workload contains a fact table with the following columns:
Date: Contains the order date
Customer: Contains the customer ID
Store: Contains the store ID
Region Contains the region ID
Product: Contains the product ID
Price: Contains the unit price per product
Quantity: Contains the quantity sold
Amount: Contains the price multiplied by quantity
You need to design a partition solution for the fact table. The solution must meet the following requirements:
Optimize read performance when querying sales data for a single region in a given month.
Optimize read performance when querying sales data for all regions in a given month.
Minimize the number of partitions.
Which column should you use for partitioning?
- Region
- Time
- Product
- Customer
- Price
Product - Its sales data in a single region and all regions but it also asks to minimize partitions so product is probably the most granular you can go.
You are importing data into an Azure Synapse Analytics database. The data is being inserted by using PolyBase.
You need to maximize network throughput for the import process.
What should you use?
-Sharding
-Vertical Partitioning
-Horizontal Partitioning
-Functional Partitioning
Shard the source data across multiple files. Sharding the source data into multiple files will increase the amount of bandwidth available to the import process.
You have an app named App1 that contains two datasets named dataset1 and dataset2. App1 frequently queries dataset1. App1 infrequently queries dataset2.
You need to prevent queries to dataset2 from affecting the buffer pool and aging out the data in dataset1.
Which type of partitioning should you use?
-Vertical Partitioning
-Horizontal Partitioning
-Functional Partitioning
vertical - By using vertical partitioning, different parts of the database can be isolated from each other to improve cache use.
You are designing a database solution that will host data for multiple business units.
You need to ensure that queries from one business unit do not affect the other business units.
Which type of partitioning should you use?
-Vertical Partitioning
-Horizontal Partitioning
-Functional Partitioning
functional - By using functional partitioning, different users of the database can be isolated from each other to ensure that one business unit does not affect another business unit.
You want to query the first 100 rows of a Parquet file in Azure Synapse using SQL serverless pool. How would this look?
SELECT TOP 100 * FROM OPENROWSET(BULK ‘https://app1synstg.dfs.core.windows.net/data/Data.parquet’, FORMAT = ‘PARQUET’) AS result;
You need to create a one-to-many relationship between tables in a Retail and Business Metrics database. What would you do first?
- Join Tables
- Create a Database
- Create a Schema
- Map the columns
Create a database.
You create a Microsoft Purview account and add an Azure SQL Database data source that has data lineage scan enabled.
You assign a managed identity for the Microsoft Purview account and the db_owner role for the database.
After scanning the data source, you are unable to obtain any lineage data for the tables in the database.
You need to create lineage data for the tables.
What should you do?
-Use a SQL Authentication
-Create a Master Key in the Database
-Use a user-managed service principle
Create a master key in the database.
You need a master key in the Azure SQL database for lineage to work.
Using SQL authentication will just change the way data lineage scan enables Microsoft Purview to authenticate to the data source.
Using a user-managed service principal just changes the way Microsoft Purview authenticates to the data source. You do not need a certificate, but a master key.
You need a fast way to design a healthcare provider’s database solution.
-Use a ARM Template
-Use a Azure Synapse Analytics Template
Azure Synapse Analytics database templates.
You need to run an HDInsight Hive script in Azure Data Factory and output to a specific storage folder.
What url would you use to reference the account?
-wasb://data@datastg.blob.core.windows.net/devices/
-http://datastg.blob.core.windows.net/devices/
wasb://data@datastg.blob.core.windows.net/devices/
You need to denormalize data in Azure Synapse Analytics by substituting product IDs with product names
-derived column.
-lookup
Lookup
You need to write output from a select task to multiple sinks in Azure Data Factory.
-conditional split
-new branch
New Branch
You plan to build an event processing solution.
You need to ensure that the solution will support real-time processing and batch processing of events.
Which two services should you include in the solution? Each correct answer presents part of the solution.
- Event Hubs
- Stream Analytics
- Synapse
- Data Factory
Event Hubs and Azure Stream Analytics
You have an Azure Synapse Analytics workspace named workspace1.
You plan to write new data and update existing rows in workspace1.
You create an Azure Synapse Analytics sink to write the processed data to workspace1.
You need to configure the writeBehavior parameter for the sink. The solution must minimize the number of pipelines required.
What should you use?
- Merge
- Insert
- Update
- Upsert
Upsert
You have an Azure Data Lake Storage account named store.dfs.core.windows.net and an Apache Spark notebook named Notebook1.
You plan to use Notebook1 to load and transform data in store.dfs.core.windows.net.
You need to configure the connection string for Notebook1.
Which URI should you use?
-abfss://container@store.dfs.core.windows.net/products.csv
-wasb://container@store.dfs.core.windows.net/products.csv
abfss://container@store.dfs.core.windows.net/products.csv`
To access Data Lake Storage from a Spark notebook, use the Azure Blob Filesystem driver (ABFS).
THE KEY HERE IS AZURE DATA LAKE STORAGE, not blob storage specifically.
You have an Azure subscription that contains an Azure Synapse Analytics workspace.
You use the workspace to perform ELT activities that can take up to 30 minutes to complete.
You develop an Azure function to stop the compute resources used by Azure Synapse Analytics during periods of zero activity.
You notice that it can take more than 20 minutes for the compute resources to stop.
You need to minimize the time it takes to stop the compute resources. The solution must minimize the impact on running transactions.
How should you change the function?
- Check the sys.dm_operation_status
dynamic management view until no transactions are active in the database before stopping the compute resources.
- Add a 20 minute timer
- Set the database to read only before stopping the compute resources
Check the sys.dm_operation_status
dynamic management view until no transactions are active in the database before stopping the compute resources.
Checking the sys.dm_operation_status dynamic management view until no transactions are active in the database before stopping the compute resources ensures that any running transaction will finish before stopping the computer nodes. If you stop the node while a transaction is running, the transaction will be rolled back, which can take time to occur.
You have an Azure subscription that contains a Delta Lake solution. The solution contains a table named employees.
You need to view the contents of the employees table from 24 hours ago. You must minimize the time it takes to retrieve the data.
What should you do?
- Time Stamp as Of
- Version as Of
Query the table by using TIMESTAMP AS OF.
You are developing an Azure Databricks solution.
You need to ensure that workloads support PyTorch code. The solution must minimize costs.
Which workload persona should you use?
-Data Engineering
-Machine Learning
-Databricks SQL
Machine Learning
You are writing a data import task in Azure Data Factory.
You need to increase the number of rows per call to the REST sink
What should you change?
- writeBatchSize to 1,000
- writeBatchSize to 10,000
- writeBatchSize to 100,000
writeBatchSize to 100,000
To increase the number of records per batch, we need to increase the writeBatchSize. The default value for this parameter is 10,000, so to increase this we need to use a value that is higher than the default.
You have an Azure Stream Analytics solution that receives data from multiple thermostats in a building.
You need to write a query that returns the average temperature per device every five minutes for readings within that same five minute period.
Which two windowing functions could you use?
Tumbling
Hopping
Sliding
Snapshot
Tumbling Window & Hopping Window
Tumbling windows have a defined period and can aggregate all events for that same time period. A tumbling window is essentially a specific case of a hopping window where the time period and the event aggregation period are the same.
Hopping windows have a defined period and can aggregate the events for a potentially different time period
Sliding windows are used to create aggregations for so many events, not at identical timelapses.
Snapshot windows aggregate all events with the same timestamp.
You are building a real-time streaming process in Azure Data Factory.
You need to aggregate the data being processed by the stream.
Which stage of the integration pattern should you configure?
Extract
Transform
Load
Transform
You have an Azure Data Factory pipeline named Pipeline1.
You need to ensure that Pipeline1 runs when an email is received.
What should you use to create the trigger?
Azure Data Factory
Azure Logic App
An Azure Logic App
You have an Azure Data Factory pipeline named Pipeline1. Pipeline1 executes many API write operations every time it runs. Pipeline1 is scheduled to run every five minutes.
After executing Pipeline1 10 times, you notice the following entry in the logs.
Type=Microsoft.DataTransfer.Execution.Core.ExecutionException,Message=There are substantial concurrent MappingDataflow executions which is causing failures due to throttling under Integration Runtime ‘AutoResolveIntegrationRuntime’.
You need to ensure that you can run Pipeline1 every five minutes.
What should you do?
-Create a new integration runtime and a new Pipeline as a copy of Pipeline1. Configure both pipelines to run every 10 minutes, five minutes apart.
-Change the compute size
-Add a second trigger setting each to run every 10 minute, 5 minutes apart
Create a new integration runtime and a new Pipeline as a copy of Pipeline1. Configure both pipelines to run every 10 minutes, five minutes apart.
The throttling issue is showing that the auto resolve integration issue it the bottle neck - adding more could be a solution.
You have an Azure Data Factory pipeline named Pipeline1. Pipeline1 includes a data flow activity named Dataflow1. Dataflow1 uses a source named source1. Source1 contains 1.5 million rows.
Dataflow1 takes 20 minutes to complete.
You need to debug Pipeline1. The solution must reduce the number of rows that flow through the activities in Dataflow1.
What should you do?
-Set the filter by last modified setting in source1
-Enable Sampling in source 1
-Enable staging in pipeline 1
-Add a new integration runtime for pipeline1
Enable sampling in source1.
Enabling sampling in source1 allows you to specify how many rows to retrieve.
You are testing a change to an Azure Data Factory pipeline.
You need to check the change into source control without affecting other users’ work in the data factory.
What should you do?
Save the change to a forked branch in the source control project.
You have an Azure Synapse Analytics data pipeline.
You need to run the pipeline at scheduled intervals.
What should you configure?
A Trigger
A Schedule
A Debug Run
A trigger
You are developing an Apache Spark pipeline to transform data from a source to a target.
You need to filter the data in a column named Category where the category is cars.
Which command should you run?
df.select(“ProductName”, “ListPrice”).where((df[“Category”] == “Cars”)
The correct format of the where statement is putting .where after the select statement.
You have a database named DB1 and a data warehouse named DW1.
You need to ensure that all changes to DB1 are stored in DW1.The solution must capture the new value and the existing value and store each value as a new record.
What should you include in the solution?
-Change Data Capture
-Change Data Feed
-Change Tracking
change data capture
The key here is that the solution must CAPTURE the new value, not just identify the change.
You have a database named DB1 and a data warehouse named DW1.
You need to ensure that all changes to DB1 are stored in DW1. The solution must meet the following requirements:
Identify that a row has changed, but not the final value of the row.
Minimize the performance impact on the source system.
What should you include in the solution?
Change Tracking
Change tracking captures the fact that a row was changed without tracking the data that was changed. Change tracking requires less server resources than change data capture.
You plan to configure an Azure Stream Analytics job named Job1.
You need to identify which components Job1 requires to perform event processing and analyze streaming data.
Which three components should you identify? Each correct answer presents part of the solution.
A Query
A Input
A Output
You have an Azure subscription that contains an Azure Stream Analytics solution.
You need to write a query that calculates the average rainfall per hour. The solution must segment the data stream into a contiguous series of fixed-size, non-overlapping time segments.
Which windowing function should you use?
-hopping
-tumbling
-sliding
-snapshot
tumbling - tumbling window functions segment a data stream into a contiguous series of fixed-size, non-overlapping time segments.
You use an Azure Databricks pipeline to process a stateful streaming operation.
You need to reduce the amount of state data to improve latency during a long-running steaming operation.
What should you use in the streaming DataFrame?
-Watermark
-Partition
Watermarks - Watermarks interact with output modes to control when data is written to the sink. Because watermarks reduce the total amount of state information to be processed, effective use of watermarks is essential for efficient stateful streaming throughput.
You have an Azure Data Factory pipeline that includes two activities named Act1 and Act2. Act1 and Act2 run in parallel.
You need to ensure that Act2 will only run once Act1 completes.
Which dependency condition should you configure?
-Failure on Act 1
-Success on Act 1
-Completed on Act 1
Completed
You need to implement encryption at rest by using transparent data encryption (TDE).
You implement a master key.
What should you do next?
- Create a certificate that is protected by the master key
- Backup the Master Database
- Create Data Encryption
Create a certificate that is protected by the master key.
You are implementing an application that queries a table named Purchase in an Azure Synapse Analytics Dedicated SQL pool.
The application must show data only for the currently signed-in user.
You use row-level security (RLS), implement a security policy, and implement a function that uses a filter predicate.
Users in the marketing department report that they cannot see their data.
What should you do to ensure that the marketing department users can see their data?
- Add a blocking predicate
- Grant SELECT permissions on the purchase table to all marketing users
- Grant SELECT permission to the function
- Rebuild the Schema
Grant the SELECT permission on the Purchase table to the Marketing users
You use Azure Data Factory to connect to a notebook that runs in an Azure Databricks cluster. The connection is set to use access tokens.
You need to revoke a user’s token.
What should you use?
-Token Management API 2.0
-Admin Console
-Conditional Access Restriction Policies
-IAM permission Adjustments
Token Management API 2.0
As the connection is set to use Access tokens, not IAM.
You have an Azure subscription that contains the following resources:
An Azure Synapse Analytics workspace named workspace1
A virtual network named VNet1 that has two subnets named sn1 and sn2
Five virtual machines that are connected to sn1
You need to ensure that the virtual machines can connect to workspace1.
The solution must prevent traffic from the virtual machines to workspace1 from traversing the public internet.
What should you create?
- Network Peering
- Application Gateway
- Private Endpoint
- Service endpoint
Private Endpoint
You have an Azure Synapse Analytics workspace.
You need to measure the performance of SQL queries running on the dedicated SQL pool.
Which two actions achieve the goal? Each correct answer presents a complete solution
- From the Monitor page of Azure Synapse Studio, review the SQL requests tab.
- Query the sys.dm_pdw_exec_request view.
- Add a Service endpoints
- Configure Network peering,
- From the Monitor page of Azure Synapse Studio, review the SQL requests tab.
- Query the sys.dm_pdw_exec_request view.
You have an Azure Synapse Analytics workspace.
You need to configure the diagnostics settings for pipeline runs. You must retain the data for auditing purposes indefinitely and minimize costs associated with retaining the data.
Which destination should you use?
- Azure Monitor Log
- Archive to a storage account
- Send the Data to a Data Partner
Archive to a storage account.