Exam questions Flashcards

1
Q

You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement.

CREATE TABLE [dbo] . [DimEmployee] (
[EmployeeKey] [int] IDENTITY (1, 1) NOT NULL,
[EmployeeID] [int] NOT NULL,
[FirstName] [varchar] (100) NOT NULL,
[LastName] [varchar] (100) NOT NULL,
[JobTitle] [varchar] (100) NULL,
[LastHireDate] [date] NULL,
[StreetAddress] [varchar] (500) NOT NULL,
[City] [varchar] (200) NOT NULL,
[StateProvince] [varchar] (50) NOT NULL,
[Portalcode] [varchar] (10) NOT NULL
)

You need to alter the table to meet the following requirements:
✑ Ensure that users can identify the current manager of employees.
✑ Support creating an employee reporting hierarchy for your entire company.
✑ Provide fast lookup of the managers’ attributes such as name and job title.
Which column should you add to the table?

A. [ManagerEmployeeID] [smallint] NULL
B. [ManagerEmployeeKey] [smallint] NULL
C. [ManagerEmployeeKey] [int] NULL
D. [ManagerName] varchar NULL

A

Correct Answer: C
We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int column.
C as the data types of the primary key should be same for the manager.
Reference:
https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.

CREATE TABLE mytestdb.myParquetTable(
EmployeeID int,
EmployeeName string,
EmployeeStartDate date)

USING Parquet - You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.

EmployeeName|EmployeeStartDate|EmployeeID
Alice | 2020-01-25 | 24

One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.

SELECT EmployeeID -
FROM mytestdb.dbo.myParquetTable
WHERE EmployeeName = 'Alice';

What will be returned by the query?
A. 24
B. an error
C. a null value

A

I did a test, waited for one minute and tried the query in a serverless sql pool and received 24 as the result, so I don’t understand that B has been voted so much because the answer is A) 24 without a doubt

Debate on B as dollows
Answer is B, but not because of the lowercase. The case has nothing to do with the error.
If you look attentively, you will notice that we create table mytestdb.myParquetTable, but the select statement contains the reference to table mytestdb.dbo.myParquetTable (!!! - dbo).
Here is the error message I got:
Error: spark_catalog requires a single-part namespace, but got [mytestdb, dbo].

I just tried to run the commands, and that error you had is due to the fact that you queried through Spark pool (!!), I did that as a test and got the exact same error. To query the data using Spark Pool, you don’t use the “.dbo” reference, this only works if you’re using a Synapse Serverless Pool.
So the correct answer is A!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

HARD
DRAG DROP -
You have a table named SalesFact in an enterprise data warehouse in Azure Synapse Analytics. SalesFact contains sales data from the past 36 months and has the following characteristics:
✑ Is partitioned by month
✑ Contains one billion rows
✑ Has clustered columnstore index

At the beginning of each month, you need to remove data from SalesFact that is older than 36 months as quickly as possible.

Which three actions should you perform in sequence in a stored procedure? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.
Select and Place:

Actions

  • Switch the partition containing the stale data from SalesFact to SalesFact_Work.
  • Truncate the partition containing the stale data.
  • Drop the SalesFact_Work table.
  • Create an empty table named SalesFact_Work that has the same schema as SalesFact.
  • Execute a DELETE statement where the value in the Date column is more than 36 months ago.
  • Copy the data to a new table by using CREATE TABLE AS SELECT (CTAS).
A

Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact.
Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work.
SQL Data Warehouse supports partition splitting, merging, and switching. To switch partitions between two tables, you must ensure that the partitions align on their respective boundaries and that the table definitions match.
Loading data into partitions with partition switching is a convenient way stage new data in a table that is not visible to users the switch in the new data.
Step 3: Drop the SalesFact_Work table.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

You have files and folders in Azure Data Lake Storage Gen2 for an Azure Synapse workspace as shown in the following exhibit.

/topfolder/*<children are following>
/File1.csv
/folder1/File2.csv
/folder2/File3.csv
/File4.csv

You create an external table named ExtTable that has LOCATION=’/topfolder/’.
When you query ExtTable by using an Azure Synapse Analytics serverless SQL pool, which files are returned?
A. File2.csv and File3.csv only
B. File1.csv and File4.csv only
C. File1.csv, File2.csv, File3.csv, and File4.csv
D. File1.csv only

A

I believe the answer should be B.
In case of a serverless pool a wildcard should be added to the location.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-create-external-table

“Serverless SQL pool can recursively traverse folders only if you specify /** at the end of path.”
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-folders-multiple-csv-files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

HOTSPOT -

You are planning the deployment of Azure Data Lake Storage Gen2.
You have the following two reports that will access the data lake:
✑ Report1: Reads three columns from a file that contains 50 columns.
✑ Report2: Queries a single record based on a timestamp.
You need to recommend in which format to store the data in the data lake to support the reports. The solution must minimize read times.
What should you recommend for each report? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
~~~
Report1:
* Avro
* CSV
* Parquet
* TSV
Report2:
* Avro
* CSV
* Parquet
* TSV
~~~

A

1: Parquet - column-oriented binary file format
2: AVRO- Row based format, and has logical type timestamp
https://youtu.be/UrWthx8T3UY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

You are designing the folder structure for an Azure Data Lake Storage Gen2 container.
Users will query data by using a variety of services including Azure Databricks and Azure Synapse Analytics serverless SQL pools. The data will be secured by subject area. Most queries will include data from the current year or current month.

Which folder structure should you recommend to support fast queries and simplified folder security?

A. /{SubjectArea}/{DataSource}/{DD}/{MM}/{YYYY}/{FileData}{YYYY}{MM}{DD}.csv
B. /{DD}/{MM}/{YYYY}/{SubjectArea}/{DataSource}/{FileData}
{YYYY}{MM}{DD}.csv
C. /{YYYY}/{MM}/{DD}/{SubjectArea}/{DataSource}/{FileData}{YYYY}{MM}{DD}.csv
D. /{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}
{YYYY}{MM}{DD}.csv

A

Correct Answer: D
There’s an important reason to put the date at the end of the directory structure. If you want to lock down certain regions or subject matters to users/groups, then you can easily do so with the POSIX permissions. Otherwise, if there was a need to restrict a certain security group to viewing just the UK data or certain planes, with the date structure in front a separate permission would be required for numerous directories under every hour directory. Additionally, having the date structure in front would exponentially increase the number of directories as time went on.
Note: In IoT workloads, there can be a great deal of data being landed in the data store that spans across numerous products, devices, organizations, and customers. It’s important to pre-plan the directory layout for organization, security, and efficient processing of the data for down-stream consumers. A general template to consider might be the following layout:
{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/

Serverless SQL Pools offers a straight-forward method of querying data including CSV, JSON, and Parquet format stored in Azure Storage.

So, setting up the csv files within azure storage in hive-formated folder hierarchy i.e. /{yyyy}/{mm}/{dd}/ actually helps in sql querying the data much faster since only the partitioned segment of the data is queried.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

HOTSPOT -
You need to output files from Azure Data Factory.
Which file format should you use for each type of output? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:
~~~
Columnar format:
* Avro
* GZip
* Parquet
* TXT

JSON with a timestamp:
* Avro
* GZip
* Parquet
* TXT
~~~

A

Box 1: Parquet -
Parquet stores data in columns, while Avro stores data in a row-based format. By their very nature, column-oriented data stores are optimized for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads.

Box 2: Avro -
An Avro schema is created using JSON format.
AVRO supports timestamps.
Note: Azure Data Factory supports the following file formats (not GZip or TXT).

Avro format -

✑ Binary format
✑ Delimited text format
✑ Excel format
✑ JSON format
✑ ORC format
✑ Parquet format
✑ XML format
Reference:
https://www.datanami.com/2018/05/16/big-data-file-formats-demystified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

HOTSPOT -
You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.

Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and data from a subsidiary of your company.

You need to move the files to a different folder and transform the data to meet the following requirements:
✑ Provide the fastest possible query times.
✑ Automatically infer the schema from the underlying files.

How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Copy behavior:
* Flatten hierarchy
* Merge files
* Preserve hierarchy

Sink file type:
* CSV
* JSON
* Parquet
* TXT
A

1. Merge Files
2. Parquet

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance

Larger files lead to better performance and reduced costs.

Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). S

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hard
HOTSPOT -
You have a data model that you plan to implement in a data warehouse in Azure Synapse Analytics as shown in the following exhibit.
see site for img of this
Dim_Employee
* iEmployeeID
* vcEmployeeLastName
* vcEmployeeMName
* vcEmployeeFirstName
* dtEmployeeHireDate
* dtEmployeeLevel
* dtEmployeeLastPromotion
Fact_DailyBookings
* iDailyBookingsID
* iCustomerID
* iTimeID
* iEmployeeID
* iItemID
* iQuantityOrdered
* dExchangeRate
* iCountryofOrigin
* mUnitPrice
Dim_Customer
* iCustomerID
* vcCustomerName
* vcCustomerAddress1
* vcCustomerCity
Dim_Time
* iTimeID
* iCalendarDay
* iCalendarWeek
* iCalendarMonth
* vcDayofWeek
* vcDayofMonth
* vcDayofYear
* iHolidayIndicator

All the dimension tables will be less than 2 GB after compression, and the fact table will be approximately 6 TB. The dimension tables will be relatively static with very few data inserts and updates.

Which type of table should you use for each table? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Dim_Customer:
* Hash distributed
* Round-robin
* Replicated

Dim_Employee:
* Hash distributed
* Round-robin
* Replicated

Dim_Time:
* Hash distributed
* Round-robin
* Replicated

Fact_DailyBookings:
* Hash distributed
* Round-robin
* Replicated
A

Box 1: Replicated -
Replicated tables are ideal for small star-schema dimension tables, because the fact table is often distributed on a column that is not compatible with the connected dimension tables. If this case applies to your schema, consider changing small dimension tables currently implemented as round-robin to replicated.

Box 2: Replicated -

Box 3: Replicated -

Box 4: Hash-distributed -
For Fact tables use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.
Reference:
https://azure.microsoft.com/en-us/updates/reduce-data-movement-and-make-your-queries-more-efficient-with-the-general-availability-of-replicated-tables/ https://azure.microsoft.com/en-us/blog/replicated-tables-now-generally-available-in-azure-sql-data-warehouse/

The answer is correct.
The Dims are under 2gb so no point in use hash.

Common distribution methods for tables:

The table category often determines which option to choose for distributing the table.
Table category Recommended distribution option
Fact -Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.
Dimension - Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.
Staging - Use round-robin for the staging table. The load with CTAS is fast. Once the data is in the staging table, use INSERT…SELECT to move the data to production tables.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview#common-distribution-methods-for-tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

SIMILIAR TO ANOTHER QUESTION BUT SAME ANWSERS

HOTSPOT -
You have an Azure Data Lake Storage Gen2 container.
Data is ingested into the container, and then transformed by a data integration application. The data is NOT modified after that. Users can read files in the container but cannot modify the files.
You need to design a data archiving solution that meets the following requirements:
✑ New data is accessed frequently and must be available as quickly as possible.
✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.
✑ Data that is older than seven years is NOT accessed. After seven years, the data must be persisted at the lowest cost possible.
✑ Costs must be minimized while maintaining the required availability.
How should you manage the data? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point
Hot Area:
~~~
Five-year-old data:
* Delete the blob.
* Move to archive storage.
* Move to cool storage.
* Move to hot storage.

Seven-year-old data:
* Delete the blob.
* Move to archive storage.
* Move to cool storage.
* Move to hot storage.
~~~

A

Box 1: Move to cool storage -

Box 2: Move to archive storage -
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

link
DRAG DROP -
You need to create a partitioned table in an Azure Synapse Analytics dedicated SQL pool.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:
Values
* CLUSTERED INDEX
* COLLATE
* DISTRIBUTION
* PARTITION
* PARTITION FUNCTION
* PARTITION SCHEME

Answer Area
~~~
CREATE TABLE table1
(
ID INTEGER,
col1 VARCHAR(10),
col2 VARCHAR (10)
) WITH

<XXXXXXXXXXXX> = HASH (ID) ,
<YYYYYYYYYYYYY> (ID RANGE LEFT FOR VALUES (1, 1000000, 2000000))
~~~
</YYYYYYYYYYYYY></XXXXXXXXXXXX>

A

Box 1: DISTRIBUTION -
Table distribution options include DISTRIBUTION = HASH ( distribution_column_name ), assigns each row to one distribution by hashing the value stored in distribution_column_name.

Box 2: PARTITION -
Table partition options. Syntax:
PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,…n] ] ))
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

You need to design an Azure Synapse Analytics dedicated SQL pool that meets the following requirements:
✑ Can return an employee record from a given point in time.
✑ Maintains the latest employee information.
✑ Minimizes query complexity.
How should you model the employee data?
A. as a temporal table
B. as a SQL graph table
C. as a degenerate dimension table
D. as a Type 2 slowly changing dimension (SCD) table

A

Correct Answer: D 🗳️
A Type 2 SCD supports versioning of dimension members. Often the source system doesn’t store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example,
IsCurrent) to easily filter by current dimension members.
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Hard
You have an enterprise-wide Azure Data Lake Storage Gen2 account. The data lake is accessible only through an Azure virtual network named VNET1.
You are building a SQL pool in Azure Synapse that will use data from the data lake.
Your company has a sales team. All the members of the sales team are in an Azure Active Directory group named Sales. POSIX controls are used to assign the
Sales group access to the files in the data lake.
You plan to load data to the SQL pool every hour.
You need to ensure that the SQL pool can load the sales data from the data lake.
Which three actions should you perform? Each correct answer presents part of the solution.
NOTE: Each area selection is worth one point.
A. Add the managed identity to the Sales group.
B. Use the managed identity as the credentials for the data load process.
C. Create a shared access signature (SAS).
D. Add your Azure Active Directory (Azure AD) account to the Sales group.
E. Use the shared access signature (SAS) as the credentials for the data load process.
F. Create a managed identity.

A

F. Create a managed identity.
A. Add the managed identity to the Sales group.
B. Use the managed identity as the credentials for the data load process.

The managed identity grants permissions to the dedicated SQL pools in the workspace.
Note: Managed identity for Azure resources is a feature of Azure Active Directory. The feature provides Azure services with an automatically managed identity in

Azure AD -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-identity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

14

VIEW WEBSITE FOR IMGs
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool that contains the users shown in the following table.
[MISSING STUFF]
User1 executes a query on the database, and the query returns the results shown in the following exhibit.
[MISSING STUFF]
User1 is the only user who has access to the unmasked data.
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
NOTE: Each correct selection is worth one point.
Hot Area::

When User2 queries the YearlyIncome column,
the values returned will be [answer choice].
* a random number
* the values stored in the database
* XXXX
* 0

When User1 queries the BirthDate column, the
values returned will be [answer choice].
* a random date
* the values stored in the database
* xxxX
* 1900-01-01
A

Box 1: 0 -
The YearlyIncome column is of the money data type.
The Default masking function: Full masking according to the data types of the designated fields
✑ Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
Box 2: the values stored in the database
Users with administrator privileges are always excluded from masking, and see the original data without any mask.
Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview
* Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
* Use 01-01-1900 for date/time data types (date, datetime2, datetime, datetimeoffset, smalldatetime, time).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

You have an enterprise data warehouse in Azure Synapse Analytics.

Using PolyBase, you create an external table named [Ext].[Items] to query Parquet files stored in Azure Data Lake Storage Gen2 without importing the data to the data warehouse.

The external table has three columns.

You discover that the Parquet files have a fourth column named ItemID.

Which command should you run to add the ItemID column to the external table?

A.

ALTER EXTERNAL TABLE [Ext]. [Items]
ADD [ItemID] int;

B.

DROP EXTERNAL FILE FORMAT parquetfilel;
CREATE EXTERNAL FILE FORMAT parquetfile1
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache. hadoop.io. compress. SnappyCodec'

C.

DROP EXTERNAL TABLE [Ext]. [Items]
CREATE EXTERNAL TABLE [Ext]. [Items]
([ItemID] [int] NULL,
[ItemName] nvarchar (50) NULL,
[ItemType] nvarchar (20) NULL,
[ItemDescription] nvarchar (250) )
WITH
LOCATION= '/Items/',
DATA_SOURCE = AzureDataLakeStore,
FILE_FORMAT = PARQUET,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0

D.

ALTER TABLE [Ext] . [Items]
ADD [ItemID] int;
A

C is correct, since “altering the schema or format of an external SQL table is not supported”.
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/external-sql-tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

16

HOTSPOT -
You have two Azure Storage accounts named Storage1 and Storage2. Each account holds one container and has the hierarchical namespace enabled. The system has files that contain data stored in the Apache Parquet format.
You need to copy folders and files from Storage1 to Storage2 by using a Data Factory copy activity. The solution must meet the following requirements:
✑ No transformations must be performed.
✑ The original folder structure must be retained.
✑ Minimize time required to perform the copy activity.
How should you configure the copy activity? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Source dataset type:
* Binary
* Parquet
* Delimited text 

Copy activity copy behavior:
* FlattenHierarchy
* MergeFiles
* PreserveHierarchy
A

Box 1: Parquet -
For Parquet datasets, the type property of the copy activity source must be set to ParquetSource.

Box 2: PreserveHierarchy -
PreserveHierarchy (default): Preserves the file hierarchy in the target folder. The relative path of the source file to the source folder is identical to the relative path of the target file to the target folder.
Incorrect Answers:
✑ FlattenHierarchy: All files from the source folder are in the first level of the target folder. The target files have autogenerated names.
✑ MergeFiles: Merges all files from the source folder to one file. If the file name is specified, the merged file name is the specified name. Otherwise, it’s an autogenerated file name.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-data-lake-storage

Answer seems correct as data is store is parquet already and requirement is to do no transformation so answer is right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

You have an Azure Data Lake Storage Gen2 container that contains 100 TB of data.
You need to ensure that the data in the container is available for read workloads in a secondary region if an outage occurs in the primary region . The solution must minimize costs.
Which type of data redundancy should you use?
A. geo-redundant storage (GRS)
B. read-access geo-redundant storage (RA-GRS)
C. zone-redundant storage (ZRS)
D. locally-redundant storage (LRS)

A

B is right
Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional outages. However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region. When you enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary region becomes unavailable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

You plan to implement an Azure Data Lake Gen 2 storage account.
You need to ensure that the data lake will remain available if a data center failsin the primary Azure region. The solution must minimize costs.
Which type of replication should you use for the storage account?
A. geo-redundant storage (GRS)
B. geo-zone-redundant storage (GZRS)
C. locally-redundant storage (LRS)
D. zone-redundant storage (ZRS)

A

First, about the Question:
What fails? -> The (complete) DataCenter, not the region and not components inside a DataCenter.

So, what helps us in this situation?
LRS: “..copies your data synchronously three times within a single physical location in the primary region.” Important is here the SINGLE PHYSICAL LOCATION (meaning inside the same Data Center. So in our scenario all copies wouldn’t work anymore.)
-> C is wrong.
ZRS: “…copies your data synchronously across three Azure availability zones in the primary region” (meaning, in different Data Centers. In our scenario this would meet the requirements)
-> D is right
GRS/GZRS: are like LRS/ZRS but with the Data Centers in different azure regions. This works too but is more expensive than ZRS. So ZRS is the right answer.

https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Hard
HOTSPOT -
You have a SQL pool in Azure Synapse.
You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be loaded daily. The table will be truncated before each daily load.
You need to create the staging table. The solution must minimize how long it takes to load the data to the staging table.
How should you configure the table? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.

Hot Area:

Distribution:
* Hash
* Replicated
* Round-robin

Indexing:
* Clustered
* Clustered columnstore
* Heap

Partitioning:
* Date
* None
A

Distribution: Round-Robin
Indexing: Heap
PartitionIng: None

Round-robin - this is the simplest distribution model, not great for querying but fast to process
Heap - no brainer when creating staging tables
No partitions - this is a staging table, why add effort to partition, when truncated daily?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.

**Name** **Data type** **Nullable**
PurchaseKey      Bigint       No
DateKey       Int       No
SupplierKey       Int       No
StockItemKey       Int       No
PurchaseOrderID       Int       No
OrderedQuantity       Int       Yes
OrderedOuters       Int       No
ReceivedOuters       Int       No
Package       Nvarchar(50)       No
IsOrderFinalized       Bit       No
LineageKey      Int       No

FactPurchase will have 1 million rows of data added daily and will contain three years of data.

Transact-SQL queries similar to the following query will be executed daily.

SELECT -
SupplierKey, StockItemKey, IsOrderFinalized, COUNT(*)

FROM FactPurchase -

WHERE DateKey >= 20210101 -

AND DateKey <= 20210131 -
GROUP By SupplierKey, StockItemKey, IsOrderFinalized

Which table distribution will minimize query times?
A. replicated
B. hash-distributed on PurchaseKey
C. round-robin
D. hash-distributed on IsOrderFinalized

A

Correct Answer: B
Hash-distributed tables improve query performance on large fact tables.
To balance the parallel processing, select a distribution column that:
✑ Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values.
✑ Does not have NULLs, or has only a few NULLs.
✑ Is not a date column.
Incorrect Answers:
C: Round-robin tables are useful for improving loading speed.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Is it hash-distributed on PurchaseKey and not on IsOrderFinalized because ‘IsOrderFinalized’ yields less distributions(rows either contain yes,no values) compared to PurchaseKey?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Hard

HOTSPOT -
From a website analytics system, you receive data extracts about user interactions such as downloads, link clicks, form submissions, and video plays.

The data contains the following columns.

Name

  • EventCategory
  • EventAction
  • EventLabel
  • ChannelGrouping
  • TotalEvents
  • UniqueEvents
  • SessionWith Events
  • Date

Sample value

  • 15 Jan 2021
  • Videos
  • Play
  • Contoso Promotional
  • Social
  • 150
  • 120
  • 99

You need to design a star schema to support analytical queries of the data. The star schema will contain four tables including a date dimension.

To which table should you add each column? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

EventCategory:
* DimChannel
* DimDate
* DimEvent
* FactEvents

ChannelGrouping:
* DimChannel
* DimDate
* DimEvent
* FactEvents

TotalEvents:
* DimChannel
* DimDate
* DimEvent
* FactEvents
A

Box 1: DimEvent -

Box 2: DimChannel -

Box 3: FactEvents -
Fact tables store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, etc
Reference:
https://docs.microsoft.com/en-us/power-bi/guidance/star-schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.

You need to prepare the files to ensure that the data copies quickly.
Solution: You convert the files to compressed delimited text files.
Does this meet the goal?

A. Yes
B. No

A

The answer is A
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
Compression doesn’t not only help to reduce the size or space occupied by a file in a storage but also increases the speed of file movement during transfer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Hard
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.

You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.

You need to prepare the files to ensure that the data copies quickly.

Solution: You copy the files to a table that has a columnstore index.
Does this meet the goal?

A. Yes
B. No

A

Correct Answer: B
Instead convert the files to compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data

From the documentation, loads to heap table are faster than indexed tables. So, better to use heap table than columnstore index table in this case.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#heap-tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You modify the files to ensure that each row is more than 1 MB.
Does this meet the goal?

A. Yes
B. No

A

Correct Answer: B
Instead convert the files to compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data

No, rows need to have less than 1 MB. A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size capacity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool.

Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for use in inventory reports. The inventory reports will use the data and additional WHERE parameters depending on the report. The reports will be produced once daily.

You need to implement a solution to make the dataset available for the reports. The solution must minimize query times.
What should you implement?

A. an ordered clustered columnstore index
B. a materialized view
C. result set caching
D. a replicated table

A

only time maternialised view appears in question set
B is correct.
Materialized view and result set caching

These two features in dedicated SQL pool are used for query performance tuning. Result set caching is used for getting high concurrency and fast response from repetitive queries against static data.

To use the cached result, the form of the cache requesting query must match with the query that produced the cache. In addition, the cached result must apply to the entire query.

Materialized views allow data changes in the base tables. Data in materialized views can be applied to a piece of a query. This support allows the same materialized views to be used by different queries that share some computation for faster performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1.
You plan to create a database named DB1 in Pool1.

You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL pool.

Which format should you use for the tables in DB1?
A. CSV
B. ORC
C. JSON
D. Parquet

A

Correct Answer: D
Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each database existing in serverless Apache Spark pools.
For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool database.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables

So A and D. Parquet are faster so D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

HARD HARD HARD
You are planning a solution to aggregate streaming data that originates in Apache Kafka and is output to Azure Data Lake Storage Gen2. The developers who will implement the stream processing solution use Java.
Which service should you recommend using to process the streaming data?
A. Azure Event Hubs
B. Azure Data Factory
C. Azure Stream Analytics
D. Azure Databricks

A

Correct Answer: D
Azure Databrics as the question is clearly asking the support for Java programming.
Reference:
[SEE SITE/REFERENCE FOR CONTEXT]
https://www.examtopics.com/exams/microsoft/dp-203/view/3/
https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/stream-processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

hard- not clear anwser

You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files will vary based on the number of events that occur per hour.

File sizes range from 4 KB to 5 GB.

You need to ensure that the files stored in the container are optimized for batch processing.

What should you do?

A. Convert the files to JSON
B. Convert the files to Avro
C. Compress the files
D. Merge the files

A

Debated quite heavily, see site,GO WITH FOLLOWING

Selected Answer: D
D. Merge the files
To optimize the files stored in the Azure Data Lake Storage Gen2 container for batch processing, you should merge the files. Merging smaller files into larger files is a common optimization technique in data processing scenarios.

Having a large number of small files can introduce overhead in terms of file management, metadata processing, and data scanning. By merging the smaller files into larger files, you can reduce this overhead and improve the efficiency of batch processing operations.

Merging the files is especially beneficial when dealing with varying file sizes, as it helps to create a more balanced distribution of data across the files and reduces the impact of small files on processing performance.

Therefore, in this scenario, merging the files would be the recommended approach to optimize the files for batch processing.

Correct Answer: B

Avro supports batch and is very relevant for streaming.
Note: Avro is framework developed within Apache’s Hadoop project. It is a row-based storage format which is widely used as a serialization process. AVRO stores its schema in JSON format making it easy to read and interpret by any program. The data itself is stored in binary format by doing it compact and efficient.
Reference:
https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/

You can not merge the files if u don’t know how many files exist in ADLS2. In this case, you could easily create a file larger than 100 GB in size and decrease performance. so B is the correct answer. Convert to AVRO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

29

HOTSPOT -
You store files in an Azure Data Lake Storage Gen2 container. The container has the storage policy shown in the following exhibit. SEE SITE FOR THIS
~~~
{“rules”: [
{

“enabled”: true,
“name”: “contosorule”,
“type”: “Lifecycle”,
“definition”: {
“actions”: {
“version”: {
“delete”: {
“daysAfterCreationGreaterThan”: 60
“baseBlob”: {
“tierToCool”: {
“daysAfterModificationGreaterThan”: 30
},

“filters”: {
“blobTypes”: [
“blockBlob”
],
“prefixMatch”: [
“container1/contoso”
]
}}}]}
~~~
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
NOTE: Each correct selection is worth one point.

Hot Area:
~~~
The files are [answer choice] after 30 days:
* deleted from the container
* moved to archive storage
* moved to cool storage
* moved to hot storage
The storage policy applies to [answer choice]:
* container1/contoso.csv
* container1/docs/contoso.json
* container1/mycontoso/contoso.csv
~~~

A

Box 1: moved to cool storage -
The ManagementPolicyBaseBlob.TierToCool property gets or sets the function to tier blobs to cool storage. Support blobs currently at Hot tier.

Box 2: container1/contoso.csv -
As defined by prefixMatch.
prefixMatch: An array of strings for prefixes to be matched. Each rule can define up to 10 case-senstive prefixes. A prefix string must start with a container name.
Reference:
https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.management.storage.fluent.models.managementpolicybaseblob.tiertocool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore index and will include the following columns:
✑ TransactionType: 40 million rows per transaction type
✑ CustomerSegment: 4 million per customer segment
✑ TransactionMonth: 65 million rows per month
AccountType: 500 million per account type

You have the following query requirements:
✑ Analysts will most commonly analyze transactions for a given month.
✑ Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type
You need to recommend a partition strategy for the table to minimize query times.
On which column should you recommend partitioning the table?
A. CustomerSegment
B. AccountType
C. TransactionType
D. TransactionMonth

A

Correct Answer: D
For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases.
Example: Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition.

Select D because analysts will most commonly analyze transactions for a given month,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

HOTSPOT -
You have an Azure Data Lake Storage Gen2 account named account1 that stores logs as shown in the following table.

link

You do not expect that the logs will be accessed during the retention periods.
You need to recommend a solution for account1 that meets the following requirements:
✑ Automatically deletes the logs at the end of each retention period
✑ Minimizes storage costs

What should you include in the recommendation? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

To minimize storage costs:
* Store the infrastructure logs and the application logs in the Archive access tier
* Store the infrastructure logs and the application logs in the Cool access tier
* Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier

To delete logs automatically:
* Azure Data Factory pipelines
* Azure Blob storage lifecycle management rules
* Immutable Azure Blob storage time-based retention policies

A

Box 1: Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier

“Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive tier and then deleted or moved to the Hot tier after 45 days, you’ll be charged an early deletion fee equivalent to 135 (180 minus 45) days of storing that blob in the Archive tier.”

For infrastructure logs: Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the cool tier should be stored for a minimum of 30 days. The cool tier has lower storage costs and higher access costs compared to the hot tier.
For application logs: Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on the order of hours.
Data in the archive tier should be stored for a minimum of 180 days.
Box 2: Azure Blob storage lifecycle management rules
Blob storage lifecycle management offers a rule-based policy that you can use to transition your data to the desired access tier when your specified conditions are met. You can also use lifecycle management to expire data at the end of its life.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview

DISCUSSION
“Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive tier and then deleted or moved to the Hot tier after 45 days, you’ll be charged an early deletion fee equivalent to 135 (180 minus 45) days of storing that blob in the Archive tier.” <- from the sourced link.

This explains why we have to use two different access tiers rather than both as archive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

You plan to ingest streaming social media data by using Azure Stream Ana

You plan to ingest streaming social media data by using Azure Stream Analytics. The data will be stored in files in Azure Data Lake Storage, and then consumed by using Azure Databricks and PolyBase in Azure Synapse Analytics.

You need to recommend a Stream Analytics data output format to ensure that the queries from Databricks and PolyBase against the files encounter the fewest possible errors. The solution must ensure that the files can be queried quickly and that the data type information is retained.
What should you recommend?
A. JSON
B. Parquet
C. CSV
D. Avro

A

Correct Answer: B
Need Parquet to support both Databricks and PolyBase.
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql

Avro schema definitions are JSON records. Polybase does not support JSON so why supporting Avro then. A CSV does not contain the schema as it is everything marked as string. so only parquet is left to choose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned fact table named dbo.Sales and a staging table named stg.Sales that has the matching table and partition definitions.

You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in stg.Sales. The solution must minimize load times.

What should you do?
A. Insert the data from stg.Sales into dbo.Sales.
B. Switch the first partition from dbo.Sales to stg.Sales.
C. Switch the first partition from stg.Sales to dbo.Sales.
D. Update dbo.Sales from stg.Sales.

A

DEBATED
* this must be C. since the need is to overwrite dbo.Sales with the content of stg.Sales.
SWITCH source TO target
* This is quite a weird situation because according to Microsoft documentation: “When reassigning a table’s data as a partition to an already-existing partitioned table, or switching a partition from one partitioned table to another, the target partition must exist and it MUST BE EMPTY.” (https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-table-transact-sql?view=azure-sqldw-latest&preserve-view=true#switch–partition-source_partition_number_expression–to–schema_name–target_table–partition-target_partition_number_expression-) Therefore none of the options would be possible if considering that both tables are not empty on that partition. Then I have no idea what would be the correct answer, although I answered C.

Exam Topics pick
Correct Answer: B 🗳️
A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data monthly. Then you can switch out the partition with data for an empty partition from another table
Note: Syntax:
SWITCH [ PARTITION source_partition_number_expression ] TO [ schema_name. ] target_table [ PARTITION target_partition_number_expression ]
Switches a block of data in one of the following ways:
✑ Reassigns all data of a table as a partition to an already-existing partitioned table.
✑ Switches a partition from one partitioned table to another.
✑ Reassigns all data in one partition of a partitioned table to an existing non-partitioned table.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Hard
You are designing a slowly changing dimension (SCD) for supplier data in an Azure Synapse Analytics dedicated SQL pool.

You plan to keep a record of changes to the available fields.

The supplier data contains the following columns.

Name                   - Description
-----------------------|------------------------------------------------------
SupplierSystemID       - Unique supplier ID in an enterprise resource planning 
                        (ERP) system
SupplierName           - Name of the supplier company
SupplierAddress1       - Address of the supplier company
SupplierAddress2       - Second address of the supplier company
                        (if applicable)
SupplierCity           - City of the supplier company
SupplierStateProvince  - State or province of the supplier company
SupplierCountry        - Country of the supplier company
SupplierPostalCode     - Postal code of the supplier company
SupplierDescription    - Free-text description of the supplier company
SupplierCategory       - Category of goods provided by the supplier company

Which three additional columns should you add to the data to create a Type 2 SCD? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. surrogate primary key
B. effective start date
C. business key
D. last modified date
E. effective end date
F. foreign key

A

DEBATED but pretty confident in following

The answer is ABE. A type 2 SCD requires a surrogate key to uniquely identify each record when versioning.

See https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types under SCD Type 2 “ the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member.”

A business key is already part of this table - SupplierSystemID. The column is derived from the source data.

Exam Topics Anwser
Correct Answer: BCE
C: The Slowly Changing Dimension transformation requires at least one business key column.
BE: Historical attribute changes create new records instead of updating existing ones. The only change that is permitted in an existing record is an update to a column that indicates whether the record is current or expired. This kind of change is equivalent to a Type 2 change. The Slowly Changing Dimension transformation directs these rows to two outputs: Historical Attribute Inserts Output and New Output.
Reference:
https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

hard
HOTSPOT -
You have a Microsoft SQL Server database that uses a third normal form schema.

You plan to migrate the data in the database to a star schema in an Azure Synapse Analytics dedicated SQL pool

You need to design the dimension tables. The solution must optimize read operations.

What should you include in the solution? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Transform data for the dimension tables by:

  • Maintaining to a third normal form
  • Normalizing to a fourth normal form
  • Denormalizing to a second normal form

For the primary key columns in the dimension tables, use:

  • New IDENTITY columns
  • A new computed column
  • The business key column from the source sys
A

Box 1: Denormalize to a second normal form
Denormalization is the process of transforming higher normal forms to lower normal forms via storing the join of higher normal form relations as a base relation.
Denormalization increases the performance in data retrieval at cost of bringing update anomalies to a database.

Box 2: New identity columns -
The collapsing relations strategy can be used in this step to collapse classification entities into component entities to obtain flat dimension tables with single-part keys that connect directly to the fact table. The single-part key is a surrogate key generated to ensure it remains unique over time.
Example:

Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance.
Reference:
https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/ https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

HOTSPOT -
You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns:
✑ ProductID
✑ ItemPrice
✑ LineTotal
✑ Quantity
✑ StoreID
✑ Minute
✑ Month
✑ Hour
✑ Year
✑ Day

You need to store the data to support hourly incremental load pipelines that will vary for each Store ID. The solution must minimize storage costs.
How should you complete the code?

To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

df.write
* . bucketBy
* . partitionBy
* . range
* . sortBy

  • (“*”)
  • (“StoreID”, “Hour”)
  • (“StoreID”,”Year”, “Month”, “Day”, “Hour”)

.mode (“append”)
* .csv (“/Purchases”)
* . json (“/Purchases”)
* . parquet (“/Purchases”)
* . saveAsTable (“/Purchases”)

A

Box 1: partitionBy -
We should overwrite at the partition level.
Example:
df.write.partitionBy(“y”,”m”,”d”)
.mode(SaveMode.Append)
.parquet(“/data/hive/warehouse/db_name.db/” + tableName)
Box 2: (“StoreID”, “Year”, “Month”, “Day”, “Hour”, “StoreID”)

if partitioned by storeid and hour only, the same hours from different days would go to the same partition, that would be innefficient

Box 3: parquet(“/Purchases”)
Reference:
https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

hard hard

You are designing a partition strategy for a fact table in an Azure Synapse Analytics dedicated SQL pool. The table has the following specifications:
✑ Contain sales data for 20,000 products.
Use hash distribution on a column named ProductID.

✑ Contain 2.4 billion records for the years 2019 and 2020.
Which number of partition ranges provides optimal compression and performance for the clustered columnstore index?
A. 40
B. 240
C. 400
D. 2,400

A

Correct Answer: A
Each partition should have around 1 millions records. Dedication SQL pools already have 60 partitions.
We have the formula: Records/(Partitions*60)= 1 million
Partitions= Records/(1 million * 60)
Partitions= 2.4 x 1,000,000,000/(1,000,000 * 60) = 40
Note: Having too many partitions can reduce the effectiveness of clustered columnstore indexes if each partition has fewer than 1 million rows. Dedicated SQL pools automatically partition your data into 60 databases. So, if you create a table with 100 partitions, the result will be 6000 partitions.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

hard link
HOTSPOT -
You are creating dimensions for a data warehouse in an Azure Synapse Analytics dedicated SQL pool.

You create a table by using the Transact-SQL statement shown in the following exhibit.

Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
NOTE: Each correct selection is worth one point.

Hot Area:

DimProduct is a [answer choice] slowly changing dimension (SCD).
* Type 0
* Type 1
* Type 2

The ProductKey column is [answer choice].
* a surrogate key
* a business key
* an audit column
A

Type2 because there are start and end columns and ProductKey is a surrogate key. ProductNumber seems a business key.
product key is a surrogate key as it is an identity column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.
~~~
| Name | Data type | Nullable |
|——————|————-|———-|
| PurchaseKey | Bigint | No |
| DateKey | Int | No |
| SupplierKey | Int | No |
| StockItemKey | Int | No |
| PurchaseOrderID | Int | Yes |
| OrderedQuantity | Int | No |
| OrderedOuters | Int | No |
| ReceivedOuters | Int | No |
| Package | Nvarchar(50)| No |
| IsOrderFinalized | Bit | No |
| LineageKey | Int | No |
~~~

FactPurchase will have 1 million rows of data added daily and will contain three years of data.

Transact-SQL queries similar to the following query will be executed daily.

SELECT -
SupplierKey, StockItemKey, COUNT(*)

FROM FactPurchase -

WHERE DateKey >= 20210101 -

AND DateKey <= 20210131 -
GROUP By SupplierKey, StockItemKey

Which table distribution will minimize query times?
A. replicated
B. hash-distributed on PurchaseKey
C. round-robin
D. hash-distributed on DateKey

A

Correct Answer: B
Hash-distributed tables improve query performance on large fact tables, and are the focus of this article. Round-robin tables are useful for improving loading speed.
Incorrect:
Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Lots of discusssion on this one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

You are implementing a batch dataset in the Parquet format.

Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure Synapse Analytics serverless SQL pool.

You need to minimize storage costs for the solution.

What should you do?
A. Use Snappy compression for the files.
B. Use OPENROWSET to query the Parquet files.
C. Create an external table that contains a subset of columns from the Parquet files.
D. Store all data as string in the Parquet files.

A

Answer should be A, because this talks about minimizing storage costs, not querying costs
id go with A
Lots of debating this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

HARD hard link
DRAG DROP -
You need to build a solution to ensure that users can query specific files in an Azure Data Lake Storage Gen2 account from an Azure Synapse Analytics serverless SQL pool.

Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select.
Select and Place:

* Create an external file format object1
* Create an external data source 
* Create a query that uses Create Table as Select 
* Create a table 
* Create an external table
A

Step 1: Create an external data source
You can create external tables in Synapse SQL pools via the following steps:
1. CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the storage.
2. CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files.
Step 2: Create an external file format object
Creating an external file format is a prerequisite for creating an external table.
3. CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.
Step 3: Create an external table
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

You are designing a data mart for the human resources (HR) department at your company. The data mart will contain employee information and employee transactions.

From a source system, you have a flat extract that has the following fields:
✑ EmployeeID
FirstName -
✑ LastName
✑ Recipient
✑ GrossAmount
✑ TransactionID
✑ GovernmentID
✑ NetAmountPaid
✑ TransactionDate

You need to design a star schema data model in an Azure Synapse Analytics dedicated SQL pool for the data mart.

Which two tables should you create? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.

A. a dimension table for Transaction
B. a dimension table for EmployeeTransaction
C. a dimension table for Employee
D. a fact table for Employee
E. a fact table for Transaction
A

C. a dimension table for Employee
E. a fact table for Transaction

C: Dimension tables contain attribute data that might change but usually changes infrequently. For example, a customer’s name and address are stored in a dimension table and updated only when the customer’s profile changes. To minimize the size of a large fact table, the customer’s name and address don’t need to be in every row of a fact table. Instead, the fact table and the dimension table can share a customer ID. A query can join the two tables to associate a customer’s profile and transactions.
E: Fact tables contain quantitative data that are commonly generated in a transactional system, and then loaded into the dedicated SQL pool. For example, a retail business generates sales transactions every day, and then loads the data into a dedicated SQL pool fact table for analysis.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

You are designing a dimension table for a data warehouse. The table will track the value of the dimension attributes over time and preserve the history of the data by adding new rows as the data changes.
Which type of slowly changing dimension (SCD) should you use?

A. Type 0
B. Type 1
C. Type 2
D. Type 3
A

Correct Answer: C
A Type 2 SCD supports versioning of dimension members. Often the source system doesn’t store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example,M IsCurrent) to easily filter by current dimension members.
Incorrect
Answers:
B: A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten.
D: A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD.
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

DRAG DROP -
You have data stored in thousands of CSV files in Azure Data Lake Storage Gen2. Each file has a header row followed by a properly formatted carriage return (/ r) and line feed (/n).

You are implementing a pattern that batch loads the files daily into a dedicated SQL pool in Azure Synapse Analytics by using PolyBase.

You need to skip the header row when you import the files into the data warehouse. Before building the loading pattern, you need to prepare the required database objects in Azure Synapse Analytics.

Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

NOTE: Each correct selection is worth one point
Select and Place:

* Create a database scoped credential that uses Azure Active Directory Application and a Service Principal Key
* Create an external data source that uses the abfs location
* Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages
* Create an external file format and set the First Row option
A

go with chat
STEP 1: Create a database scoped credential that uses Azure Active Directory Application and a Service Principal Key

Step 2: Create an external data source that uses the abfs location
Create External Data Source to reference Azure Data Lake Store Gen 1 or 2
Step 3: Create an external file format and set the First_Row option.

DEBATED
Examtopics anwser
Step 1: Create an external data source that uses the abfs location
Create External Data Source to reference Azure Data Lake Store Gen 1 or 2
Step 2: Create an external file format and set the First_Row option.
Create External File Format.
Step 3: Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages
To use PolyBase, you must create external tables to reference your external data.
Use reject options.
Note: REJECT options don’t apply at the time this CREATE EXTERNAL TABLE AS SELECT statement is run. Instead, they’re specified here so that the database can use them at a later time when it imports data from the external table. Later, when the CREATE TABLE AS SELECT statement selects data from the external table, the database will use the reject options to determine the number or percentage of rows that can fail to import before it stops the import.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-t-sql-objects https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-as-select-transact-sql

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

hard GO TO LINK FOR THIS
HOTSPOT -
You are building an Azure Synapse Analytics dedicated SQL pool that will contain a fact table for transactions from the first half of the year 2020.

You need to ensure that the table meets the following requirements:
✑ Minimizes the processing time to delete data that is older than 10 years
✑ Minimizes the I/O for queries that use year-to-date values

How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

CREATE TABLE [dbo] . [FactTransaction]
( 
[TransactionTypeID] int NOT NULL

[TransactionDateID] int NOT NULL

[CustomerID] int NOT NULL

[RecipientID] int NOT NULL

[Amount] money NOT NULL
)
WITH
( XXXXXXXXXXXXXXXXXXXX
( YYYYYYYYYYYYYYYYYYYY RANGE RIGHT FOR VALUES
(20200101, 20200201, 20200301, 20200401, 20200501, 20200601)
A

Box 1: PARTITION -
RANGE RIGHT FOR VALUES is used with PARTITION.
Part 2: [TransactionDateID]
Partition on the date column.
Example: Creating a RANGE RIGHT partition function on a datetime column
The following partition function partitions a table or index into 12 partitions, one for each month of a year’s worth of values in a datetime column.
CREATE PARTITION FUNCTION [myDateRangePF1] (datetime)
AS RANGE RIGHT FOR VALUES (‘20030201’, ‘20030301’, ‘20030401’,
‘20030501’, ‘20030601’, ‘20030701’, ‘20030801’,
‘20030901’, ‘20031001’, ‘20031101’, ‘20031201’);
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

You are performing exploratory analysis of the bus fare data in an Azure Data Lake Storage Gen2 account by using an Azure Synapse Analytics serverless SQL pool.

You execute the Transact-SQL query shown in the following exhibit.

SELECT
payment_type,
SUM (fare_amount) AS fare total
FROM OPENROWSET (
BULK 'csv/busfare/tripdata_2020 *. csv',
DATA SOURCE = 'BusData',
FORMAT = 'CSV', PARSER VERSION = '2.0',
FIRSTROW = 2

WITH (
payment_type INT 10,
fare_amount FLOAT 11
) AS nyc
GROUP BY payment_type
ORDER BY payment_type;

What do the query results include?
A. Only CSV files in the tripdata_2020 subfolder.
B. All files that have file names that beginning with “tripdata_2020”.
C. All CSV files that have file names that contain “tripdata_2020”.
D. Only CSV that have file names that beginning with “tripdata_2020”.

A

Correct Answer: D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

hard link
DRAG DROP -
You use PySpark in Azure Databricks to parse the following JSON input.

{
"persons": [
{
"name":"Keith",
"age":30,
"dogs":["Fido", "Fluffy"]
{
"name":"Donna",
"age":46,
"dogs": ["Spot"]
}

You need to output the data in the following tabular format.

|-------|-----|--------|
| Donna | 46  | Spot   |

How should you complete the PySpark code? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the spit bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.
Select and Place:

Values
* alias
* array_union
* createDataFrame
* explode
* select
* translate

dbutils.fs.put("/tmp/source.json", source_json, True)
source_df = spark.read.option ("multiline", "true"). json("/tmp/source. json")
persons = source_df. XXXXXXXX YYYYYYYYYY ("persons") .alias("persons"))
persons_dogs = persons.select (col ("persons.name") .alias ("owner"), col ("persons.age") .alias ("age"),
explode ZZZZZZZZZZ ("dog"))
("persons-dogs").
display (persons_dogs)

| Keith | 30 | Fido |

| Keith | 30 | Fluffy |

owner | age | dog |

A

Box 1: select -

Box 2: explode -

Bop 3: alias -
pyspark.sql.Column.alias returns this column aliased with a new name or names (in the case of expressions that return more than one column, such as explode).
Reference:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.alias.html https://docs.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/explode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

HOTSPOT -
You are designing an application that will store petabytes of medical imaging data.

When the data is first created, the data will be accessed frequently during the first week. After one month, the data must be accessible within 30 seconds, but files will be accessed infrequently. After one year, the data will be accessed infrequently but must be accessible within five minutes.

You need to select a storage strategy for the data. The solution must minimize costs.

Which storage tier should you use for each time frame? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:
~~~
First week:
* Archive
* Cool
* Hot
After one month:
* Archive
* Cool
* Hot
After one year:
* Archive
* Cool
* Hot
~~~

A

First Week: Hot
After One Month: Cool
After OneYear: Cool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

hard
You have an Azure Synapse Analytics Apache Spark pool named Pool1.
You plan to load JSON files from an Azure Data Lake Storage Gen2 container into the tables in Pool1. The structure and data types vary by file.

You need to load the files into the tables. The solution must maintain the source data types.

What should you do?
A. Use a Conditional Split transformation in an Azure Synapse data flow.
B. Use a Get Metadata activity in Azure Data Factory.
C. Load the data by using the OPENROWSET Transact-SQL command in an Azure Synapse Analytics serverless SQL pool.
D. Load the data by using PySpark.

A

Should be D, it’s about Apache Spark pool, not serverless SQL pool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

You have an Azure Databricks workspace named workspace1 in the Standard pricing tier. Workspace1 contains an all-purpose cluster named cluster1.

You need to reduce the time it takes for cluster1 to start and scale up. The solution must minimize costs.

What should you do first?
A. Configure a global init script for workspace1.
B. Create a cluster policy in workspace1.
C. Upgrade workspace1 to the Premium pricing tier.
D. Create a pool in workspace1.

A

Answer D is correct. Azure Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use instances.

You can use Databricks Pools to Speed up your Data Pipelines and Scale Clusters Quickly.
Databricks Pools, a managed cache of virtual machine instances that enables clusters to start and scale 4 times faster.
Reference:
https://databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Hard
HOTSPOT -
You are building an Azure Stream Analytics job that queries reference data from a product catalog file. The file is updated daily.

The reference data input details for the file are shown in the Input exhibit. (Click the Input tab.)

YOULL HAVE TO GO TO THE WEBSITE FOR THIS IMG

You need to configure the Stream Analytics job to pick up the new reference data.
What should you configure?
To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Path pattern:
* {date}/product.csv
* {date}/{time}/product.csv
* product.csv
* */product.csv

Date format:
* MM/DD/YYYY
* YYYY/MM/DD
* YYYY-DD-MM
* YYYY-MM-DD

A

First Box = {date}/product.csv - Because the requirement is reference data loaded on daily basis, so it may be once in a day not hourly or timely.
second box is straight forwarded answer YYYY-MM-DD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

THEOFFICIAL ANWSER TO THIS IS NOT CLEAR
HOTSPOT -
You have the following Azure Stream Analytics query.

WITH
step1 AS (SELECT *
        FROM input1
        PARTITION BY StateID
        INTO 10),
step2 AS (SELECT *
        FROM input2
        PARTITION BY StateID
        INTO 10)
SELECT *        
INTO output
FROM step1
PARTITION BY StateID
UNION
SELECT * INTO output
        FROM step2
        PARTITION BY StateID

For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.
Hot Area:

The query combines two streams of partitioned data.         YES/NO

The stream scheme key and count must match the output scheme.          YES/NO

Providing 60 streaming units will optimize the performance of the query.         YES/NO
A

DEBATED HEAVILY

False (60/40), True (reasonably confident), False (reasonably confident).
https://learn.microsoft.com/en-us/azure/stream-analytics/repartition
The first is False, because this:
“The following example query joins two streams of repartitioned data.”
It’s extracted from the link above, and it’s pointing to our query! Repartitioned and not partitioned.
Second is True, it’s explicitly written
The output scheme should match the stream scheme key and count so that each substream can be flushed independently.
Third is False,
“In general, six SUs are needed for each partition.”
In the example we have 10 positions for step 1 and 10 for step 2, it should be 120 and not 60.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

hard
HOTSPOT -
You are building a database in an Azure Synapse Analytics serverless SQL pool.
You have data stored in Parquet files in an Azure Data Lake Storege Gen2 container.
Records are structured as shown in the following sample.

{
"id": 123,
"address_housenumber": "19c",
"address_line": "Memory Lane",
"applicant1_name": "Jane",
"applicant2_name": "Dev"
}

The records contain two applicants at most.

You need to build a table that includes only the address fields.

How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:
SEE IMG ON SITE

XXXXXXXX applications
*** CREATE EXTERNAL TABLE
* CREATE TABLE
* CREATE VIEW**
WITH (
         LOCATION = 'applications/',
         DATA_SOURCE = applications_ds,
         FILE FORMAT = applications_file_format
)
AS
SELECT id, [address_housenumber] as addresshousenumber, [address_line1] as addressline1
FROM
XXXXXXXX (BULK 'https://contoso1.dfs.core.windows.net/applications/year =* / *. parquet', ....
*** CROSS APPLY
* OPENJSON
* OPENROWSET**
FORMAT=' PARQUET') AS [r]
GO
A

Box 1: CREATE EXTERNAL TABLE -
An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or serverless SQL pool.
Syntax:
CREATE EXTERNAL TABLE { database_name.schema_name.table_name | schema_name.table_name | table_name }
( <column_definition> [ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name</column_definition>

Box 2. OPENROWSET -
When using serverless SQL pool, CETAS is used to create an external table and export query results to Azure Storage Blob or Azure Data Lake Storage Gen2.
Example:

AS -
SELECT decennialTime, stateName, SUM(population) AS population

FROM -
OPENROWSET(BULK ‘https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=/.parquet’,
FORMAT=’PARQUET’) AS [r]
GROUP BY decennialTime, stateName

GO -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Hard
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and an Azure Data Lake Storage Gen2 account named Account1.

You plan to access the files in Account1 by using an external table.

You need to create a data source in Pool1 that you can reference when you create the external table.

How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

CREATE EXTERNAL DATA SOURCE source1
WITH
( LOCATION = 'https://account1.[ blob, dfs, table ].core.windons.net',
[PUSHDOWN = ON, TYPE = BLOB_STORAGE, TYPE = HADOOP]
A

BOX 1: DEBATED, go for dfs i think
Box 2: TYPE = HADOOP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

You have an Azure subscription that contains an Azure Blob Storage account named storage1 and an Azure Synapse Analytics dedicated SQL pool named Pool1.

You need to store data in storage1. The data will be read by Pool1. The solution must meet the following requirements:
* Enable Pool1 to skip columns and rows that are unnecessary in a query.
* Automatically create column statistics.
* Minimize the size of files.

Which type of file should you use?
A. JSON
B. Parquet
C. Avro
D. CSV

A

Correct Answer: B 🗳️
Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of CSV files statistics is supported.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

DRAG DROP -
You plan to create a table in an Azure Synapse Analytics dedicated SQL pool.
Data in the table will be retained for five years. Once a year, data that is older than five years will be deleted.

You need to ensure that the data is distributed evenly across partitions. The solution must minimize the amount of time required to delete old data.

How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.

VALUES
* CustomerKey
* HASH
* ROUND ROBIN
* REPLICATE
* OrderDateKey
* SalesOrderNumber

SEE SITE FOR IMG
DISTRIBUTION = XXXXXXXXXX ([ProductKey])
PARTITION [ XXXXXXXXX] RANGE RIGHT FOR VALUES

A

Box 1: HASH -

Box 2: OrderDateKey -
In most cases, table partitions are created on a date column.
A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data early. Then you can switch out the partition with data for an empty partition from another table.
Reference:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

VERY SIMILIAR TO ANOTHER QUESTION

HOTSPOT -
You have an Azure Data Lake Storage Gen2 service.

You need to design a data archiving solution that meets the following requirements:
✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.
✑ Data that is older than seven years is NOT accessed.
✑ Costs must be minimized while maintaining the required availability.

How should you manage the data? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.

Hot Area:

Data over five years old:
* Delete the blob.
* Move to archive storage.
* Move to cool storage.
* Move to hot storage.

Data over seven years old:
* Delete the blob.
* Move to archive storage.
* Move to cool storage.

A

Box 1: Move to cool storage -

Box 2: Move to archive storage -
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

HOTSPOT -
You plan to create an Azure Data Lake Storage Gen2 account.

You need to recommend a storage solution that meets the following requirements:
✑ Provides the highest degree of data resiliency
✑ Ensures that content remains available for writes if a primary data center fails

What should you include in the recommendation? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Replication mechanism:
* Change feed
* Zone-redundant storage (ZRS)
* Read-access geo-redundant storage (RA-GRS)
* Read-access geo-zone-redundant storage (RA-GRS)

Failover process:
* Failover initiated by Microsoft
* Failover manually initiated by the customer
* Failover automatically initiated by an Azure Automation job

A

DEBATED HEAVILY
Zone-redundant storage (ZRS)
‘Ensures that content remains available for writes if a primary data center fails’. RA-GRS and RAGZRS provide read access only after failover. The correct answer is ZRS as t=stated in the link below “Microsoft recommends using ZRS in the primary region for Azure Data Lake Storage Gen2 workloads.” https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json
Failover initiated by Microsoft
Failover initiated by Microsoft.
Customer-managed account failover is not yet supported in accounts that have a hierarchical namespace (Azure Data Lake Storage Gen2). To learn more, see Blob storage features available in Azure Data Lake Storage Gen2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

You need to implement a Type 3 slowly changing dimension (SCD) for product category data in an Azure Synapse Analytics dedicated SQL pool.

You have a table that was created by using the following Transact-SQL statement.

CREATE TABLE [DB0] . [DimProduct] (
[ProductKey] [int] IDENTITY (1, 1) NOT NULL,
[ProductSourceID] [int] NOT NULL,
[ProductNane] [nvarchar] (100) NOT NULL,
[Color] [nvarchar] (15) NULL,
[SellStartDate] [date] NOT NULL,
[SellEndOate] [date] NULL,
[RowInsertedDateTime] [datetime] NOT NULL,
[RowipdatedDateTine] [datetime] NOT NULL,
[ETLAuditID] [int] NOT NULL
)

Which two columns should you add to the table? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.
A. [EffectiveEndDate] [datetime] NULL,
B. [CurrentProductCategory] [nvarchar] (100) NOT NULL,
C. [ProductCategory] [nvarchar] (100) NOT NULL,
D. [EffectiveStartDate] [datetime] NOT NULL,
E. [OriginalProductCategory] [nvarchar] (100) NOT NULL,

A

Correct Answer: BE
A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD.
This type of tracking may be used for one or two columns in a dimension table. It is not common to use it for many members of the same table. It is often used in combination with Type 1 or Type 2 members.
SEE SITE FOR IMG EXPLANTION
https://www.examtopics.com/exams/microsoft/dp-203/view/6/
Reference:
https://k21academy.com/microsoft-azure/azure-data-engineer-dp203-q-a-day-2-live-session-review/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Hard
DRAG DROP -
You have an Azure subscription.
You plan to build a data warehouse in an Azure Synapse Analytics dedicated SQL pool named pool1 that will contain staging tables and a dimensional model.
Pool1 will contain the following tables.

|---------------------|----------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| Common.Date         | 7,300          | New rows inserted yearly       | Contains one row per date for the last 20 years. Contains columns named Year, Month, Quarter, and IsWeekend.          |
| Marketing.WebSessions | 1,500,500,000 | Hourly inserts and updates     | Fact table that contains counts of and updates sessions and page views, including foreign key values for date, channel, device, and medium. |
| Staging.WebSessions | 300,000        | Hourly truncation and inserts | Staging table for web session data, including descriptive fields for inserts channel, device, and medium. Truncation involved. |

You need to design the table storage for pool1. The solution must meet the following requirements:
✑ Maximize the performance of data loading operations to Staging.WebSessions.
✑ Minimize query times for reporting queries against the dimensional model.

Which type of table distribution should you use for each table? To answer, drag the appropriate table distribution types to the correct tables. Each table distribution type may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.
Select and Place:

Values
* Hash
* repplicated
* round-robin

Common.Data: xxxxxxxx

Marketing.Web.Sessions: xxxxxxxxxx

Staging. Web.Sessions: xxxxxxxxxxxx

Name | Number of rows | Update frequency | Description |

A

Box 1: Replicated -
The best table storage option for a small table is to replicate it across all the Compute nodes.

Box 2: Hash -
Hash-distribution improves query performance on large fact tables.

Box 3: Round-robin -
Round-robin distribution is useful for improving loading speed.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Replicated (Because its a Dimension table)
Hash (Fact table with High volume of data)
Round-Robin (Staging table)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Hard
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool.

You need to create a table named FactInternetSales that will be a large fact table in a dimensional model. FactInternetSales will contain 100 million rows and two columns named SalesAmount and OrderQuantity. Queries executed on FactInternetSales will aggregate the values in SalesAmount and OrderQuantity from the last year for a specific product. The solution must minimize the data size and query execution time.

How should you complete the code? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.

Hot Area:

CREATE TABLE [dbo]. [FactInternetSales]
( [ProductKey] int NOT NULL
, [OrderDatekey] int NOT NULL
, [CustomerKey] int NOT NULL
, [Promotionkey] int NOT NULL
, [SalesOrderNumber] nvarchar(20) NOT NULL
, [OrderQuantity] smallint NOT NULL
, [UnitPrice] money NOT NULL
, [SalesAmount] money NOT NULL
WITH
[ (CLUSTERED COLUMNSTORE INDEX / ( CLUSTERED INDEX ([OrderDateKey]) / (HEAP / ( INDEX on [ProductKey] / ]
, DISTRIBUTION =
[Hash([OrderDateKey]) / Hash([ProductKey]) / REPLICATE / ROUND_ROBIN]
);
A

Box 1: (CLUSTERED COLUMNSTORE INDEX

CLUSTERED COLUMNSTORE INDEX -
Columnstore indexes are the standard for storing and querying large data warehousing fact tables. This index uses column-based data storage and query processing to achieve gains up to 10 times the query performance in your data warehouse over traditional row-oriented storage. You can also achieve gains up to 10 times the data compression over the uncompressed data size. Beginning with SQL Server 2016 (13.x) SP1, columnstore indexes enable operational analytics: the ability to run performant real-time analytics on a transactional workload.
Note: Clustered columnstore index
A clustered columnstore index is the physical storage for the entire table.

To reduce fragmentation of the column segments and improve performance, the columnstore index might store some data temporarily into a clustered index called a deltastore and a B-tree list of IDs for deleted rows. The deltastore operations are handled behind the scenes. To return the correct query results, the clustered columnstore index combines query results from both the columnstore and the deltastore.

Box 2: HASH([ProductKey])
A hash distributed table distributes rows based on the value in the distribution column. A hash distributed table is designed to achieve high performance for queries on large tables.
Choose a distribution column with data that distributes evenly
Incorrect:
* Not HASH([OrderDateKey]). Is not a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work
* A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated tables since joins on replicated tables don’t require data movement. Replication requires extra storage, though, and isn’t practical for large tables.
* A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly. Loading data into a round-robin table is fast. Keep in mind that queries can require more data movement than the other distribution methods.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

You have an Azure Synapse Analytics dedicated SQL pool that contains a table named Table1. Table1 contains the following:
✑ One billion rows
✑ A clustered columnstore index
✑ A hash-distributed column named Product Key
✑ A column named Sales Date that is of the date data type and cannot be null

Thirty million rows will be added to Table1 each month.

You need to partition Table1 based on the Sales Date column. The solution must optimize query performance and data loading.

How often should you create a partition?
A. once per month
B. once per year
C. once per day
D. once per week

A

Debated
Correct Answer: B 🗳️

Remembering that we have data splitted in distribution (60 nodes) and considering that we Need a MINMIUM 1 million rows per distribution, we have:

A. once per month = 30 milion / 60 = 500k record per partition
B. once per year = 360 milion / 60 = 6 milion record per partition
C. once per day = about 1 milion / 60 = 16k record per partition
D. once per week =about 7.5 milion / 60 = 125k record per partition

correct should be B

Need a minimum 1 million rows per distribution. Each table is 60 distributions. 30 millions rows is added each month. Need 2 months to get a minimum of 1 million rows per distribution in a new partition.
Note: When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributions.
Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

You have an Azure Databricks workspace that contains a Delta Lake dimension table named Table1.

Table1 is a Type 2 slowly changing dimension (SCD) table.
You need to apply updates from a source table to Table1.

Which Apache Spark SQL operation should you use?
A. CREATE
B. UPDATE
C. ALTER
D. MERGE

A

Correct Answer: D 🗳️

When applying updates to a Type 2 slowly changing dimension (SCD) table in Azure Databricks, the best option is to use the MERGE operation in Apache Spark SQL. This operation allows you to combine the data from the source table with the data in the destination table, and then update or insert the appropriate r
ecords. The MERGE operation provides a powerful and flexible way to handle updates for SCD tables, as it can handle both updates and inserts in a single operation. Additionally, this operation can be performed on Delta Lake tables, which can easily handle the ACID transactions needed for handling SCD updates.

The Delta provides the ability to infer the schema for data input which further reduces the effort required in managing the schema changes. The Slowly Changing
Data(SCD) Type 2 records all the changes made to each key in the dimensional table. These operations require updating the existing rows to mark the previous values of the keys as old and then inserting new rows as the latest values. Also, Given a source table with the updates and the target table with dimensional data,
SCD Type 2 can be expressed with the merge.
Example:
// Implementing SCD Type 2 operation using merge function
customersTable
.as(“customers”)
.merge(
stagedUpdates.as(“staged_updates”),
“customers.customerId = mergeKey”)
.whenMatched(“customers.current = true AND customers.address <> staged_updates.address”)
.updateExpr(Map(
“current” -> “false”,
“endDate” -> “staged_updates.effectiveDate”))
.whenNotMatched()
.insertExpr(Map(
“customerid” -> “staged_updates.customerId”,
“address” -> “staged_updates.address”,
“current” -> “true”,
“effectiveDate” -> “staged_updates.effectiveDate”,
“endDate” -> “null”))
.execute()
}
Reference:
https://www.projectpro.io/recipes/what-is-slowly-changing-data-scd-type-2-operation-delta-table-databricks

When applying updates to a Type 2 slowly changing dimension (SCD) table in Azure Databricks, the best option is to use the MERGE operation in Apache Spark SQL. This operation allows you to combine the data from the source table with the data in the destination table, and then update or insert the appropriate records. The MERGE operation provides a powerful and flexible way to handle updates for SCD tables, as it can handle both updates and inserts in a single operation. Additionally, this operation can be performed on Delta Lake tables, which can easily handle the ACID transactions needed for handling SCD updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

You are designing an Azure Data Lake Storage solution that will transform raw JSON files for use in an analytical workload.

You need to recommend a format for the transformed files. The solution must meet the following requirements:
✑ Contain information about the data types of each column in the files.
✑ Support querying a subset of columns in the files.
✑ Support read-heavy analytical workloads.
✑ Minimize the file size.

What should you recommend?
A. JSON
B. CSV
C. Apache Avro
D. Apache Parquet

A

Correct Answer: D 🗳️
Parquet, an open-source file format for Hadoop, stores nested data structures in a flat columnar format.
Compared to a traditional approach where data is stored in a row-oriented approach, Parquet file format is more efficient in terms of storage and performance.
It is especially good for queries that read particular columns from a ג€wideג€ (with many columns) table since only needed columns are read, and IO is minimized.
Incorrect:
Not C:
The Avro format is the ideal candidate for storing data in a data lake landing zone because:
1. Data from the landing zone is usually read as a whole for further processing by downstream systems (the row-based format is more efficient in this case).
2. Downstream systems can easily retrieve table schemas from Avro files (there is no need to store the schemas separately in an external meta store).
3. Any source schema change is easily handled (schema evolution).
Reference:
https://www.clairvoyant.ai/blog/big-data-file-formats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.

You need to prepare the files to ensure that the data copies quickly.
Solution: You modify the files to ensure that each row is less than 1 MB.
Does this meet the goal?

A. Yes
B. No

A

Correct Answer: A 🗳️

Variations of this question
Solution: You convert the files to compressed delimited text files.
Does this meet the goal? YES
Solution: You copy the files to a table that has a columnstore index.
Does this meet the goal? NO
Solution: You modify the files to ensure that each row is more than 1 MB.
Does this meet the goal? NO
Solution: You modify the files to ensure that each row is less than 1 MB.
Does this meet the goal? YES

Polybase loads rows that are smaller than 1 MB.
Note on Polybase Load: PolyBase is a technology that accesses external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language.
Extract, Load, and Transform (ELT)
Extract, Load, and Transform (ELT) is a process by which data is extracted from a source system, loaded into a data warehouse, and then transformed.
The basic steps for implementing a PolyBase ELT for dedicated SQL pool are:
Extract the source data into text files.
Land the data into Azure Blob storage or Azure Data Lake Store.
Prepare the data for loading.
Load the data into dedicated SQL pool staging tables using PolyBase.
Transform the data.
Insert the data into production tables.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-service-capacity-limits https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

You plan to create a dimension table in Azure Synapse Analytics that will be less than 1 GB.
You need to create the table to meet the following requirements:
✑ Provide the fastest query time.
✑ Minimize data movement during queries.

Which type of table should you use?
A. replicated
B. hash distributed
C. heap
D. round-robin

A

Correct Answer: A 🗳️
A replicated table has a full copy of the table accessible on each Compute node. Replicating a table removes the need to transfer data among Compute nodes before a join or aggregation. Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed. 2 GB is not a hard limit.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-guidance-for-replicated-tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

You are designing a dimension table in an Azure Synapse Analytics dedicated SQL pool.
You need to create a surrogate key for the table. The solution must provide the fastest query performance.

What should you use for the surrogate key?
A. a GUID column
B. a sequence object
C. an IDENTITY column

A

Correct Answer: C 🗳️
Use IDENTITY to create surrogate keys using dedicated SQL pool in AzureSynapse Analytics.
Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

You have an Azure Data Lake Storage Gen2 account that contains a container named container1. You have an Azure Synapse Analytics serverless SQL pool that contains a native external table named dbo.Table1. The source data for dbo.Table1 is stored in container1. The folder structure of container1 is shown in the following exhibit.

The external data source is defined by using the following statement.
[SEE WEBSITE FOR IMAGES]

For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.

When selecting all the rows in dbo.Table1, data from the mydata2.csv file will be returned.          YES/NO
When selecting all the rows in dbo.Table1, data from the mydata3.csv file will be returned..          YES/NO
When selecting all the rows in dbo.Table1, data from the _mydata4.csv file will be returned..          YES/NO
A

yes, no, no&raquo_space;
Both Hadoop and native external tables
will skip the files with the names that begin with an
underline (_) or a period (.).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

You have an Azure Synapse Analytics dedicated SQL pool.

You need to create a fact table named Table1 that will store sales data from the last three years. The solution must be optimized for the following query operations:

  • Show order counts by week.
  • Calculate sales totals by region.
  • Calculate sales totals by product.
  • Find all the orders from a given month.

Which data should you use to partition Table1?

A. product
B. month
C. week
D. region

A

Selected Answer: B
When designing a fact table in a data warehouse, it is important to consider the types of queries that will be run against it. In this case, the queries that need to be optimized include: show order counts by week, calculate sales totals by region, calculate sales totals by product, and find all the orders from a given month.

Partitioning the table by month would be the best option in this scenario as it would allow for efficient querying of data by month, which is necessary for the query operations described above. For example, it would be easy to find all the orders from a given month by only searching the partition for that specific month.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

You are designing the folder structure for an Azure Data Lake Storage Gen2 account.

You identify the following usage patterns:

  • Users will query data by using Azure Synapse Analytics serverless SQL pools and Azure Synapse Analytics serverless Apache Spark pools.
  • Most queries will include a filter on the current year or week.
  • Data will be secured by data source.

You need to recommend a folder structure that meets the following requirements:

  • Supports the usage patterns
  • Simplifies folder security
  • Minimizes query times

Which folder structure should you recommend?

A. \DataSource\SubjectArea\YYYY\WW\FileData_YYYY_MM_DD.parquet

B. \DataSource\SubjectArea\YYYY-WW\FileData_YYYY_MM_DD.parquet

C. DataSource\SubjectArea\WW\YYYY\FileData_YYYY_MM_DD.parquet

D. \YYYY\WW\DataSource\SubjectArea\FileData_YYYY_MM_DD.parquet

E. WW\YYYY\SubjectArea\DataSource\FileData_YYYY_MM_DD.parquet

A

chat GPT: Based on the given usage patterns and requirements, the recommended folder structure would be option B:

\DataSource\SubjectArea\YYYY-WW\FileData_YYYY_MM_DD.parquet

This structure allows for easy filtering of data by year and week, which aligns with the identified usage pattern of most queries filtering by the current year or week. It also organizes the data by data source and subject area, which simplifies folder security. By using a flat structure, with the data files directly under the year-week folder, query times can be minimized as the data is organized for efficient partition pruning.

Option A is similar but includes an additional level of hierarchy for the year, which is unnecessary given the requirement to filter by year-week. Options C, D, and E do not follow a consistent hierarchy, making it difficult to navigate and locate specific data files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a table named table1.

You load 5 TB of data into table1.

You need to ensure that columnstore compression is maximized for table1.

Which statement should you execute?

A. DBCC INDEXDEFRAG (pool1, table1)
B. DBCC DBREINDEX (table1)
C. ALTER INDEX ALL on table1 REORGANIZE
D. ALTER INDEX ALL on table1 REBUILD

A

D. ALTER INDEX ALL on table1 REBUILD

This statement will rebuild all indexes on table1, which can help to maximize columnstore compression. The other options are not appropriate for this task.
DBCC INDEXDEFRAG (pool1, table1) is for defragmenting the indexes and DBCC DBREINDEX (table1) is for recreating the indexes. ALTER INDEX ALL on table1 REORGANIZE is for reorganizing the indexes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

You have an Azure Synapse Analytics dedicated SQL pool named pool1.

You plan to implement a star schema in pool and create a new table named DimCustomer by using the following code.

CREATE TABLE dbo. [DimCustomer](
[CustomerKey] int NOT NULL,
[CustomerSourceID] [int] NOT NULL,
[Title] [nvarchar](8) NULL,
[FirstName] [nvarchar](50) NOT NULL,
[MiddleName] [nvarchar](50) NULL,
[LastName] [nvarchar](50) NOT NULL,
[Suffix] [nvarchar](10) NULL,
[CompanyName] [nvarchar](128) NULL,
[SalesPerson] [nvarchar](256) NULL,
[EmailAddress] [nvarchar](50) NULL,
[Phone] [nvarchar](25) NULL,
[InsertedDate] [datetime] NOT NULL,
[ModifiedDate] [datetime] NOT NULL,
[HashKey] [varchar](100) NOT NULL,
[IsCurrentRow] [bit] NOT NULL
)
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
);
GO

You need to ensure that DimCustomer has the necessary columns to support a Type 2 slowly changing dimension (SCD).

Which two columns should you add? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A. [HistoricalSalesPerson] [nvarchar] (256) NOT NULL
B. [EffectiveEndDate] [datetime] NOT NULL
C. [PreviousModifiedDate] [datetime] NOT NULL
D. [RowID] [bigint] NOT NULL
E. [EffectiveStartDate] [datetime] NOT NULL

A

DEBATED
Selected Answer: BE
The date of insertion and the expiration date from when to when is something else. You can insert data now, but either with future validity or with past validity (correcting errors, for example).
So options : BE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Hard
HOTSPOT
You have an Azure subscription that contains an Azure Synapse Analytics dedicated SQL pool.
You plan to deploy a solution that will analyze sales data and include the following:

  • A table named Country that will contain 195 rows
  • A table named Sales that will contain 100 million rows
  • A query to identify total sales by country and customer from the past 30 days

You need to create the tables. The solution must maximize query performance.

How should you complete the script? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

see site for context
DISTRIBUTION =
* HASH([Customerld])
* HASH([OrderDate])
* REPLICATE
* ROUND_ROBIN

DISTRIBUTION =
* HASH([CountryCode])
* HASH([Countryld])
* REPLICATE
* ROUND_ROBIN

A

1. Hash(CustomerID)
2. Replicate
It is hash because it is a fact table (you can tell because there is the “total” column being created which is numerical). Rule of thumb, never hash on a date field, so in this case you would hash on ‘CustomerID’. You want the hash to have as many unique values as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

You have an Azure subscription that contains an Azure Data Lake Storage Gen2 account named account1 and an Azure Synapse Analytics workspace named workspace1.

You need to create an external table in a serverless SQL pool in workspace1. The external table will reference CSV files stored in account1. The solution must maximize performance.

How should you configure the external table?

A. Use a native external table and authenticate by using a shared access signature (SAS).
B. Use a native external table and authenticate by using a storage account key.
C. Use an Apache Hadoop external table and authenticate by using a shared access signature (SAS).
D. Use an Apache Hadoop external table and authenticate by using a service principal in Microsoft Azure Active Directory (Azure AD), part of Microsoft Entra.

A

Correct Answer: A 🗳️
Correct! Serverless SQL Pools cannot use Hadoop, Only Native. Access Key Auth is never best practice therefore leaving only A as a viable answer.
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop

The other options provided (B, C, and D) are not the recommended configurations for maximizing performance in this scenario. Using a storage account key for authentication (option B) poses a security risk and should be avoided. Apache Hadoop external tables (options C and D) do not provide the same level of performance optimization as native external tables in Azure Synapse Analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

HOTSPOT

You have an Azure Synapse Analytics serverless SQL pool that contains a database named db1. The data model for db1 is shown in the following exhibit.

GO TO SITE TO SEE SCHEMA

Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the exhibit.

NOTE: Each correct selection is worth one point.

To convert the data model to a star schema, [answer choice].
* join DimGeography and DimCustomer
* join DimGeography and FactOrders
* union DimGeography and DimCustomer
* union DimGeography and FactOrders

Once the data model is converted into a star schema, there will be [answer choice] tables.
* 4
* 5
* 6
* 7

A

Correct answer should be join DimGeography and DimCustomer and 5 tables.

You also need to combine ProductLine and Product in order for the schema to be considered a star schema. This would result in 5 remaining tables: DimCustomer (DimCustomer JOIN DimGeography), DimStore, Date, Product (Product JOIN ProductLine) and FactOrders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

You have an Azure Databricks workspace and an Azure Data Lake Storage Gen2 account named storage1.

New files are uploaded daily to storage1.

You need to recommend a solution that configures storage1 as a structured streaming source. The solution must meet the following requirements:

  • Incrementally process new files as they are uploaded to storage1.
  • Minimize implementation and maintenance effort.
  • Minimize the cost of processing millions of files.
  • Support schema inference and schema drift.

Which should you include in the recommendation?

A. COPY INTO
B. Azure Data Factory
C. Auto Loader
D. Apache Spark FileStreamSource

A

Correct Answer: C 🗳️
Bing explains the following:
The best option is C. Auto Loader.

Auto Loader is a feature in Azure Databricks that uses a cloudFiles data source to incrementally and efficiently process new data files as they arrive in Azure Data Lake Storage Gen2. It supports schema inference and schema evolution (drift). It also minimizes implementation and maintenance effort, as it simplifies the ETL pipeline by reducing the complexity of identifying new files for processing.

Other options do not meet the requirements because:
A. COPY INTO: does not incrementally process new files as they are uploaded, which is one of your requirements.

B. Azure Data Factory: does not natively support schema inference and schema drift. The incremental processing of new files would need to be manually implemented, which could increase implementation and maintenance effort.

D. Apache Spark FileStreamSource: requires manual setup and does not natively support schema inference or schema drift. It also may not minimize the cost of processing millions of files as efficiently as Auto Loader.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

HARD
You have an Azure subscription that contains the resources shown in the following table.

|-----------|-------------------------|--------------------------------------------------|
| storage1  | Azure Blob storage account | Contains publicly accessible TSV files that do NOT have a header row |
| WS1       | Azure Synapse Analytics workspace | Contains a serverless SQL pool |

You need to read the TSV files by using ad-hoc queries and the OPENROWSET function. The solution must assign a name and override the inferred data type of each column.

What should you include in the OPENROWSET function?

A. the WITH clause

B. the ROWSET_OPTIONS bulk option

C. the DATAFILETYPE bulk option

D. the DATA_SOURCE parameter

Name | Type | Description |

A

To read TSV files without a header row using the OPENROWSET function and to assign a name and specify the data type for each column, you should use:

A. the WITH clause

The WITH clause is used in the OPENROWSET function to define the format file or to directly define the structure of the file by specifying the column names and data types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

You have an Azure Synapse Analytics dedicated SQL pool.

You plan to create a fact table named Table1 that will contain a clustered columnstore index.

You need to optimize data compression and query performance for Table1.

What is the minimum number of rows that Table1 should contain before you create partitions?

A. 100,000
B. 600,000
C. 1 million
D. 60 million

A

Clustered Column Store will by default have 60 partitions. And to achieve best compression we need at least 1 Million rows per partition, hence Option D 60 Millions (1M per partition)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

You have an Azure Synapse Analytics dedicated SQL pool that contains a table named DimSalesPerson. DimSalesPerson contains the following columns:

  • RepSourceID
  • SalesRepID
  • FirstName
  • LastName
  • StartDate
  • EndDate
  • Region

You are developing an Azure Synapse Analytics pipeline that includes a mapping data flow named Dataflow1. Dataflow1 will read sales team data from an external source and use a Type 2 slowly changing dimension (SCD) when loading the data into DimSalesPerson.

You need to update the last name of a salesperson in DimSalesPerson.

Which two actions should you perform? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.

A. Update three columns of an existing row.
B. Update two columns of an existing row.
C. Insert an extra row.
D. Update one column of an existing row.

A

DISPUTED
Correct Answer: CD 🗳️

Selected Answer: BC
1) Insert an extra row with the updated last name and the current date as the StartDate.
2) Update two columns of an existing row: set the EndDate of the previous row for that salesperson to the current date and set the current value of the SalesRepID column to inactive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

HOTSPOT

You plan to use an Azure Data Lake Storage Gen2 account to implement a Data Lake development environment that meets the following requirements:

  • Read and write access to data must be maintained if an availability zone becomes unavailable.
  • Data that was last modified more than two years ago must be deleted automatically.
  • Costs must be minimized.

What should you configure? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

For storage redundancy:
* Geo-zone-redundant storage (GZRS)
* Locally-redundant storage (LRS)
* Zone-redundant storage (ZRS)

For data deletion:
* A lifecycle management policy
* Soft delete
* Versioning

A

Statement 1: For Storage redundancy, you should select ZRS (Zone-redundant storage). This will maintain read and write access to data even if an availability zone becomes unavailable.

Statement 2: For data deletion, you should select A lifecycle management policy. This will allow you to automatically delete data that was last modified more than two years ago

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

HOTSPOT
You are developing an Azure Synapse Analytics pipeline that will include a mapping data flow named Dataflow1. Dataflow1 will read customer data from an external source and use a Type 1 slowly changing dimension (SCD) when loading the data into a table named DimCustomer in an Azure Synapse Analytics dedicated SQL pool.

You need to ensure that Dataflow1 can perform the following tasks:

  • Detect whether the data of a given customer has changed in the DimCustomer table.
  • Perform an upsert to the DimCustomer table.

Which type of transformation should you use for each task? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Detect whether the data of a given customer has changed in the DimCustomer table:
* Aggregate
* Derived column
* Surrogate key

Perform an upsert to the DimCustomer table:
* Alter row
* Assert
* Cast

A

The answer is correct. Check “Exercise - Design and implement a Type 1 slowly changing dimension with mapping data flows”, there is described implementation of the dataflow mentioned in this question.

https://learn.microsoft.com/en-us/training/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/4-exercise-design-implement-type-1-dimension

In the exercise ‘Derived column’ transformation is used to add InsertedDate and ModifiedDate columns. ModifiedDate column can be used to detect whether the customer data has changed. For Upsert ‘Alter row’ tranformation is used. The answer is definitely correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

DRAG DROP

You have an Azure Synapse Analytics serverless SQL pool.

You have an Azure Data Lake Storage account named adls1 that contains a public container named container1. The container1 container contains a folder named folder1.

You need to query the top 100 rows of all the CSV files in folder1.

How should you complete the query? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point

VALUES
* BULK
* DATA_SOURCE
* LOCATION
* OPENROWSET

SELECT TOP 100 *
FROM XXXXXXXXX (
XXXXXXXXXXX 'https://adls1.dfs.core.windows.net/container1/folder1/ *. csv',
FORMAT = 'CSV') AS rows
A

OPENROWSET
Bulk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1.

You plan to create a database named DB1 in Pool1.

You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL pool.

Which format should you use for the tables in DB1?

A. Parquet
B. ORC
C. JSON
D. HIVE

A

Selected Answer: A
Parquet is supported by serverless SQL pool
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-parquet-files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

You have an Azure Data Lake Storage Gen2 account named storage1.

You plan to implement query acceleration for storage1.

**Which two file types support query acceleration? ** Each correct answer presents a complete solution.

NOTE: Each correct selection is worth one point.

A. JSON
B. Apache Parquet
C. XML
D. CSV
E. Avro

A

Selected Answer: AD
Query acceleration supports CSV and JSON formatted data as input to each request.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

You have an Azure subscription that contains the resources shown in the following table.

|-----------|-------------------------|--------------------------------------------------|
| storage1  | Azure Blob storage account | Contains publicly accessible JSON files         |
| WS1       | Azure Synapse Analytics workspace | Contains a serverless SQL pool          |

You need to read the files in storage1 by using ad-hoc queries and the OPENROWSET function. The solution must ensure that each rowset contains a single JSON record.

To what should you set the FORMAT option of the OPENROWSET function?

A. JSON
B. DELTA
C. PARQUET
D. CSV

Name | Type | Description |

A

JSON we sue CSV for
Chat says D: CSV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

#81

HOTSPOT

You are designing an Azure Data Lake Storage Gen2 container to store data for the human resources (HR) department and the operations department at your company.

You have the following data access requirements:

  • After initial processing, the HR department data will be retained for seven years and rarely accessed.
  • The operations department data will be accessed frequently for the first six months, and then accessed once per month.

You need to design a data retention solution to meet the access requirements. The solution must minimize storage costs.

What should you include in the storage policy for each department? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

HR:
* Archive storage after one day and delete storage after 2,555 days.
* Archive storage after 2,555 days.
* Cool storage after one day.
* Cool storage after 180 days.
* Cool storage after 180 days and delete storage after 2,555 days.
* Delete after one day.
* Delete after 180 days.

Operations:
* Archive storage after one day and delete storage after 2,555 days.
* Archive storage after 2,555 days.
* Cool storage after one day.
* Cool storage after 180 days.
* Cool storage after 180 days and delete storage after 2,555 days.
* Delete after one day.
* Delete after 180 days.

A

Archive storage after one day and delete storage after 2,555 days.

Cool storage after 180 days.

The answer for HR depends on the meaning of “rarely” and the duration of “initial processing”. If rarely is like once a year and initial processing is complete within 24 h the answer is correct. If rarely is like on a weekly basis, archiv might be the wrong way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

You have an Azure subscription that contains the Azure Synapse Analytics workspaces shown in the following table.

| Name       | Primary storage account |
|------------|-------------------------|
| workspace1 | datalake1               |
| workspace2 | datalake2               |
| workspace3 | datalake1               |

Each workspace must read and write data to datalake1.

Each workspace contains an unused Apache Spark pool.

You plan to configure each Spark pool to share catalog objects that reference datalake1.

For each of the following statements, select Yes if the statement is true. Otherwise, select No.

NOTE: Each correct selection is worth one point.

* The shared catalog objects can be stored in Azure Database for MySQL.     YES/NO
* For the Apache Hive Metastore of each workspace, you must configure a linked service that uses user-password authentication.    YES/NO
* The users of workspace1 must be assigned the Storage Blob Contributor role for datalake1.    YES/NO
A

Chat too confusing, using GPT answer

No, No, Yes

  1. The shared catalog objects can be stored in Azure Database for MySQL.
    Azure Synapse Analytics doesn’t support storing shared catalog objects in Azure Database for MySQL. Instead, it uses an Apache Hive Metastore as a common catalog for multiple workspaces.
  2. For the Apache Hive Metastore of each workspace, you must configure a linked service that uses user-password authentication.
    Azure Synapse Analytics doesn’t require user-password authentication for the Apache Hive Metastore. It typically relies on service principals or managed identities for authentication.
  3. The users of workspace1 must be assigned the Storage Blob Contributor role for datalake1.
    For workspace1 to read and write data to datalake1, users in workspace1 need adequate permissions. The Storage Blob Contributor role provides the necessary permissions to read and write data to Azure Data Lake Storage Gen2, which is datalake1 in this case. Therefore, assigning the Storage Blob Contributor role to users in workspace1 would be appropriate.

CHAT SAYS
Provided answers are correct:
1. Yes:
Azure Synapse Analytics allows Apache Spark pools in the same workspace to share a managed HMS (Hive Metastore) compatible metastore as their catalog. When customers want to persist the Hive catalog metadata outside of the workspace, and share catalog objects with other computational engines outside of the workspace, such as HDInsight and Azure Databricks, they can connect to an external Hive Metastore. Only Azure SQL Database and Azure Database for MySQL are supported as an external Hive Metastore.

2. Yes:
And currently we only support User-Password authentication.

3. No:
And currently we only support User-Password authentication. ==> STORAGE BLOB CONTRIBUTOR is an Azure RBAC (Role-Based Access Control) ==> NOT COMPATIBLE (it is supported User-Password authentication ONLY).

ref.
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-external-metastore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

You have a data warehouse.

You need to implement a slowly changing dimension (SCD) named Product that will include three columns named ProductName, ProductColor, and ProductSize. The solution must meet the following requirements:

  • Prevent changes to the values stored in ProductName.
  • Retain only the current and the last values in ProductSize.
  • Retain all the current and previous values in ProductColor.

Which type of SCD should you implement for each column? To answer, drag the appropriate types to the correct columns. Each type may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.

Values
* Type 0
* Type 1
* Type 2
* Type 3

ProductName: XXXXXXXX
Color: XXXXXXXX
Size: XXXXXXXX

A

DISPUTED:go with this

ProductName - type 0, as no changes are done.
Color - type 3, as with type 3 we have one column for the current value and one for the previous so only these two are preserved.
Size - type 2, as it inserts a new row for every change, so we get all historical values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

HOTSPOT
You are incrementally loading data into fact tables in an Azure Synapse Analytics dedicated SQL pool.

Each batch of incoming data is staged before being loaded into the fact tables.

You need to ensure that the incoming data is staged as quickly as possible.

How should you configure the staging tables? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Table distribution:
* HASH
* REPLICATE
* ROUND_ROBIN

Table structure:
* Clustered index
* Columnstore index
* Heap

A

The ROUND_ROBIN distribution distributes the data evenly across all distribution nodes in the SQL pool. This distribution type is suitable for loading data quickly into the staging tables because it minimizes the data movement during the loading process.

Use a HEAP table: Instead of creating a clustered index on the staging table, it is recommended to create a HEAP table. A HEAP table does not have a clustered index, which eliminates the need for maintaining the index and improves the data loading performance. It allows for faster insert operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

You have an Azure subscription that contains an Azure Synapse Analytics workspace named ws1 and an Azure Cosmos DB database account named Cosmos1. Cosmos1 contains a container named container1 and ws1 contains a serverless SQL pool.

You need to ensure that you can query the data in container1 by using the serverless SQL pool.

Which three actions should you perform? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A. Enable Azure Synapse Link for Cosmos1.
B. Disable the analytical store for container1.
C. In ws1, create a linked service that references Cosmos1.
D. Enable the analytical store for container1.
E. Disable indexing for container1.

A

Correct Answer: ACD 🗳️
A. Enable Azure Synapse Link for Cosmos1.
C. In ws1, create a linked service that references Cosmos1.
D. Enable the analytical store for container1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

HOTSPOT

You have an Azure subscription that contains the resources shown in the following table.

| Name       | Type                                | Description                                |
|------------|-------------------------------------|--------------------------------------------|
| Workspace1 | Azure Synapse workspace              | Deployed to Workspace1                      |
| Pool1      | Azure Synapse Analytics dedicated SQL pool | Contains the Built-in serverless SQL pool |
| storage1   | Storage account                      | Hierarchical namespace enabled              |

The storage1 account contains a container named container1. The container1 container contains the following files.

Webdata <root folder>
Monthly <folder>
_monthly.csv
Monthly.csv
.testdata.csv
testdata.csv

In Pool1, you run the following script.

CREATE EXTERNAL DATA SOURCE Ds1
WITH
( LOCATION = 'abfss://container1@storage1.dfs.core.windows.net' ,
CREDENTIAL = credential1,
TYPE = HADOOP
);

In the Built-in serverless SQL pool, you run the following script.

CREATE EXTERNAL DATA SOURCE Ds2
WITH (
LOCATION = 'https://storage1.blob.core.windows.net/container1/Webdata/',
CREDENTIAL = credential2
):

For each of the following statements, select Yes if the statement is true. Otherwise, select No.

NOTE: Each correct selection is worth one point.

An external table that uses Ds1 can read the_monthly.csv file. Yes/No

An external table that uses Ds1 can read the Monthly.csv file. Yes/No

An external table that uses Ds2 can read the .testdata.csv file. Yes/No

A

No,
Yes,
No
It will ignore “_” and “.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

You have an Azure subscription that contains an Azure Data Lake Storage Gen2 account named account1 and a user named User1.

In account1, you create a container named container1. In container1, you create a folder named folder1.

You need to ensure that User1 can list and read all the files in folder1. The solution must use the principle of least privilege.

How should you configure the permissions for each folder? To answer, drag the appropriate permissions to the correct folders. Each permission may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.

Values
* Execute
* Read
* Read and Write
* None
* Read and Execute
* Write

container1/: XXXXXXXXXXXXX

container1/folder1: XXXXXXXXXX
….

A

Box 1: Execute
Box 2: Read & Execute

https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control#levels-of-permission

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

You have an Azure Data Factory pipeline named pipeline1.

You need to execute pipeline1 at 2 AM every day. The solution must ensure that if the trigger for pipeline1 stops, the next pipeline execution will occur at 2 AM, following a restart of the trigger.

Which type of trigger should you create?

A. schedule
B. tumbling
C. storage event
D. custom event

A

GPT Says

The scenario requires a trigger that ensures the execution of the pipeline at a specific time daily and has the capability to resume or continue executions even after a trigger stoppage or restart. For this purpose:

A. Schedule

A schedule trigger in Azure Data Factory allows you to specify a recurring execution time, such as running a pipeline at 2 AM every day. Additionally, if the trigger stops due to any reason, it will resume its schedule, and the next execution will occur at the defined time (in this case, 2 AM) following a trigger restart.

Some commenets claim ir might be tumbling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

HOTSPOT

You have an Azure data factory named adf1 that contains a pipeline named ExecProduct. ExecProduct contains a data flow named Product.

The Product data flow contains the following transformations:

  1. WeeklyData: A source that points to a CSV file in an Azure Data Lake Storage Gen2 account with 20 columns
  2. ProductColumns: A select transformation that selects from WeeklyData six columns named ProductID, ProductDescr, ProductSubCategory, ProductCategory, ProductStatus, and ProductLastUpdated
  3. ProductRows: An aggregate transformation
  4. ProductList: A sink that outputs data to an Azure Synapse Analytics dedicated SQL pool

The Aggregate settings for ProductRows are configured as shown in the following exhibit.

**See site for img

For each of the following statements, select Yes if the statement is true. Otherwise, select No.

NOTE: Each correct selection is worth one point.

There will be six columns in the output of ProductRows. YES/NO

There will always be one output row for each unique value of ProductDescr. YES/NO

There will always be one output row for each unique value of ProductID. YES/NO

A

GO WITH FOLLOWING, don’t know why

YES
NO
YES
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-aggregate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

You manage an enterprise data warehouse in Azure Synapse Analytics.

Users report slow performance when they run commonly used queries. Users do not report performance changes for infrequently used queries.

You need to monitor resource utilization to determine the source of the performance issues.

Which metric should you monitor?

A. DWU limit
B. Cache hit percentage
C. Local tempdb percentage
D. Data IO percentage

A

B. Cache hit percentage should be correct since it only affects common used queries, which should be saved and loaded from cache.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

HOTSPOT

You have an Azure Synapse Analytics serverless SQL pool.

You have an Apache Parquet file that contains 10 columns.

You need to query data from the file. The solution must return only two columns.

How should you complete the query? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Values
[BULK, DELTA, OPENQUERY, SINGLE_BLOB]]

[(Col1 int, Col2 varchar(20)), FILEPATH(2), PARSER_VERSION = ‘2.0’, SINGLE_BLOB]

SELECT * FROM
OPENROWSET ([XXXXXXXXXXXX]N'https://myaccount.dfs.core.windows.net/mycontainer/mysubfolder/data.parquet', FORMAT = 'PARQUET')
WITH [YYYYYYYYYYYY] as rows
A

BULK
[(Col1 int, Col2 varchar(20)),

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

Hard
You have an Azure Synapse Analytics workspace that contains an Apache Spark pool named SparkPool1. SparkPool1 contains a Delta Lake table named SparkTable1.

You need to recommend a solution that supports Transact-SQL queries against the data referenced by SparkTable1. The solution must ensure that the queries can use partition elimination.

What should you include in the recommendation?

A. a partitioned table in a dedicated SQL pool
B. a partitioned view in a dedicated SQL pool
C. a partitioned index in a dedicated SQL pool
D. a partitioned view in a serverless SQL pool

A

Selected Answer: D
D is correct.
“The OPENROWSET function is not supported in dedicated SQL pools in Azure Synapse.” so it eliminates A,B and C.
Ref: https://learn.microsoft.com/en-us/sql/t-sql/functions/openrowset-transact-sql?view=sql-server-ver16
Only the partitioned view in the serverless sql pool is correct since “External tables in serverless SQL pools do not support partitioning on Delta Lake format. Use Delta partitioned views instead of tables if you have partitioned Delta Lake data sets.”
Ref: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-external-tables#delta-tables-on-partitioned-folders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

HARD HARD

You are designing a sales transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will contain approximately 60 million rows per month and will be partitioned by month. The table will use a clustered column store index and round-robin distribution.

Approximately how many rows will there be for each combination of distribution and partition?

A. 1 million
B. 5 million
C. 20 million
D. 60 million

A

Correct Answer: A 🗳️
Partitioned by month and with 60 nodes, means it’s 1M per combination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

You have an Azure Synapse Analytics workspace.

You plan to deploy a lake database by using a database template in Azure Synapse.

Which two elements are included in the template? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A. relationships
B. data formats
C. linked services
D. table permissions
E. table definitions

A

Correct, AE. Only table definition and their relationship is included in the template. The rest of the options should be configured
Ref: https://learn.microsoft.com/en-us/azure/synapse-analytics/database-designer/create-lake-database-from-lake-database-templates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

You are implementing a star schema in an Azure Synapse Analytics dedicated SQL pool.

You plan to create a table named DimProduct.

DimProduct must be a Type 3 slowly changing dimension (SCD) table that meets the following requirements:

  • The values in two columns named ProductKey and ProductSourceID will remain the same.
  • The values in three columns named ProductName, ProductDescription, and Color can change.

You need to add additional columns to complete the following table definition.

CREATE TABLE [dbo].[dimproduct](
[ProductKey] INT NOT NULL,
[ProductSourceID] INT NOT NULL,
[ProductName]  NVARCHAR(100) NOT NULL,
[ProductDescription] NVARCHAR(2000) NOT NULL,
[Color] NVARCHAR(50) NOT NULL
)
WITH 
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
):

Which three columns should you add? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A. [EffectiveStartDate] [datetime] NOT NULL
B. [EffectiveEndDate] [datetime] NOT NULL
C. [OriginalProductDescription] NVARCHAR(2000) NOT NULL
D. [IsCurrentRow] [bit] NOT NULL
E. [OriginalColor] NVARCHAR(50) NOT NULL
F. [OriginalProductName] NVARCHAR(100) NULL

A

Selected Answer: CEF
Correct. The other three options are needed for a scd type 2 table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

HOTSPOT -
You plan to create a real-time monitoring app that alerts users when a device travels more than 200 meters away from a designated location.

You need to design an Azure Stream Analytics job to process the data for the planned app. The solution must minimize the amount of code developed and the number of technologies used.

What should you include in the Stream Analytics job? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Input type:
* Stream
* Reference

Function:
* Aggregate
* Geospatial
* Windowing

A

The input type for the Stream Analytics job should be Stream, as it will be processing real-time data from devices.
The function to include in the Stream Analytics job should be Geospatial, which allows you to perform calculations on geographic data and make spatial queries, such as determining the distance between two points. This is necessary to determine if a device has traveled more than 200 meters away from a designated location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

HARD
A company has a real-time data analysis solution that is hosted on Microsoft Azure. The solution uses Azure Event Hub to ingest data and an Azure Stream
Analytics cloud job to analyze the data. The cloud job is configured to use 120 Streaming Units (SU).

You need to optimize performance for the Azure Stream Analytics job.

Which two actions should you perform? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. Implement event ordering.
B. Implement Azure Stream Analytics user-defined functions (UDF).
C. Implement query parallelization by partitioning the data output.
D. Scale the SU count for the job up.
E. Scale the SU count for the job down.
F. Implement query parallelization by partitioning the data input.

A

Disputed 50/50
Correct Answer: DF 🗳️
D: Scale out the query by allowing the system to process each input partition separately.
F: A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization

HOWEVER
Dicussion has the most likes (61) for CF go for this
C: Implement query parallelization by partitioning the data output.
F: Implement query parallelization by partitioning the data input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

You need to trigger an Azure Data Factory pipeline when a file arrives in an Azure Data Lake Storage Gen2 container.

Which resource provider should you enable?
A. Microsoft.Sql
B. Microsoft.Automation
C. Microsoft.EventGrid
D. Microsoft.EventHub

A

Correct Answer: C 🗳️
Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the arrival or deletion of a file in Azure
Blob Storage account. Data Factory natively integrates with Azure Event Grid, which lets you trigger pipelines on such events.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

You plan to perform batch processing in Azure Databricks once daily.
Which type of Databricks cluster should you use?
A. High Concurrency
B. automated
C. interactive

A

Correct Answer: B 🗳️
Automated Databricks clusters are the best for jobs and automated batch processing.
Note: Azure Databricks has two types of clusters: interactive and automated. You use interactive clusters to analyze data collaboratively with interactive notebooks. You use automated clusters to run fast and robust automated jobs.
Example: Scheduled batch workloads (data engineers running ETL jobs)
This scenario involves running batch job JARs and notebooks on a regular cadence through the Databricks platform.
The suggested best practice is to launch a new cluster for each run of critical jobs. This helps avoid any issues (failures, missing SLA, and so on) due to an existing workload (noisy neighbor) on a shared cluster.
Reference:
https://docs.microsoft.com/en-us/azure/databricks/clusters/create https://docs.databricks.com/administration-guide/cloud-configurations/aws/cmbp.html#scenario-3-scheduled-batch-workloads-data-engineers-running-etl-jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

HOTSPOT -
You are processing streaming data from vehicles that pass through a toll booth.
You need to use Azure Stream Analytics to return the license plate, vehicle make, and hour the last vehicle passed during each 10-minute window.

How should you complete the query? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

WITH LastInWindow AS
(
    SELECT
        <COUNT / MAX / MIN / TOPONE>(Time) AS LastEventTime
    FROM
        Input TIMESTAMP BY Time
    GROUP BY
        <HoppingWindow / SessionWindow / SlidingWindow / TumblingWindow>(minute, 10)
)

SELECT
    Input.License_plate,
    Input.Make,
    Input.Time
FROM
    Input TIMESTAMP BY Time
    INNER JOIN LastInWindow
    ON  <DATEADD / DATEDIFF / DATENAME / DATEPART>(minute, Input, LastInWindow) BETWEEN 0 AND 10
AND Input.Time = LastInWindow.LastEventTime
A

100% correct
Box 1: MAX -
The first step on the query finds the maximum time stamp in 10-minute windows, that is the time stamp of the last event for that window. The second step joins the results of the first query with the original stream to find the event that match the last time stamps in each window.

Box 2: TumblingWindow -
Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.

Box 3: DATEDIFF -
DATEDIFF is a date-specific function that compares and returns the time difference between two DateTime fields, for more information, refer to date functions.
Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

hard link
You have an Azure Data Factory instance that contains two pipelines named Pipeline1 and Pipeline2.
Pipeline1 has the activities shown in the following exhibit.
~~~
_________________________
|————————-|
| Stored procedure1 |
|_________________________|
—>
_________________________
| Set variable |
|————————-|
| (x) Set variable1 |
|_________________________|
~~~

Pipeline2 has the activities shown in the following exhibit.
~~~
_________________________ _________________________
| Execute pipeline | | Set variable |
|————————-| —> |————————-|
| Execute pipeline1 | | (x) Set variable1 |
|_________________________| |_________________________|
~~~

You execute Pipeline2, and Stored procedure1 in Pipeline1 fails.
What is the status of the pipeline runs?

A. Pipeline1 and Pipeline2 succeeded.
B. Pipeline1 and Pipeline2 failed.
C. Pipeline1 succeeded and Pipeline2 failed.
D. Pipeline1 failed and Pipeline2 succeeded.

Stored procedure |

A

Correct Answer: A 🗳️
Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or Completed.
Consider Pipeline1:

If we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1 failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory.

The failure dependency means this pipeline reports success.

Note:
If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is successful. In this scenario, if Activity1 fails, the pipeline will fail.
Reference:
https://datasavvy.me/category/azure-data-factory/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

HOTSPOT -
A company plans to use Platform-as-a-Service (PaaS) to create the new data pipeline process. The process must meet the following requirements:
Ingest:
✑ Access multiple data sources.
✑ Provide the ability to orchestrate workflow.
✑ Provide the capability to run SQL Server Integration Services packages.
Store:
✑ Optimize storage for big data workloads.
✑ Provide encryption of data at rest.
✑ Operate with no size limits.
Prepare and Train:
✑ Provide a fully-managed and interactive workspace for exploration and visualization.
✑ Provide the ability to program in R, SQL, Python, Scala, and Java.
Provide seamless user authentication with Azure Active Directory.

Model & Serve:
✑ Implement native columnar storage.
✑ Support for the SQL language
✑ Provide support for structured streaming.
You need to build the data integration pipeline.

Which technologies should you use? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Ingest
- Logic Apps
- Azure Data Factory
- Azure Automation

Store
- Azure Data Lake Storage
- Azure Blob storage
- Azure files

Prepare and Train
- HDInsight Apache Spark cluster
- Azure Databricks
- HDInsight Apache Storm cluster

Model and Serve
- HDInsight Apache Kafka cluster
- Azure Synapse Analytics
- Azure Data Lake Storage

A

Ingest: Azure Data Factory -
Azure Data Factory pipelines can execute SSIS packages.
In Azure, the following services and tools will meet the core requirements for pipeline orchestration, control flow, and data movement: Azure Data Factory, Oozie on HDInsight, and SQL Server Integration Services (SSIS).

Store: Data Lake Storage -
Data Lake Storage Gen1 provides unlimited storage.
Note: Data at rest includes information that resides in persistent storage on physical media, in any digital format. Microsoft Azure offers a variety of data storage solutions to meet different needs, including file, disk, blob, and table storage. Microsoft also provides encryption to protect Azure SQL Database, Azure Cosmos
DB, and Azure Data Lake.

Prepare and Train: Azure Databricks
Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration.
With Azure Databricks, you can set up your Apache Spark environment in minutes, autoscale and collaborate on shared projects in an interactive workspace.
Azure Databricks supports Python, Scala, R, Java and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch and scikit-learn.

Model and Serve: Azure Synapse Analytics
Azure Synapse Analytics/ SQL Data Warehouse stores data into relational tables with columnar storage.
Azure SQL Data Warehouse connector now offers efficient and scalable structured streaming write support for SQL Data Warehouse. Access SQL Data
Warehouse from Azure Databricks using the SQL Data Warehouse connector.
Note: As of November 2019, Azure SQL Data Warehouse is now Azure Synapse Analytics.
Reference:
https://docs.microsoft.com/bs-latn-ba/azure/architecture/data-guide/technology-choices/pipeline-orchestration-data-movement https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

DRAG DROP -
You have the following table named Employees.

You need to calculate the employee_type value based on the hire_date value.

How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.
Select and Place:
Values
* CASE
* ELSE
* OVER
* PARTITION BY
* ROW_NUMBER

SELECT
      *,
      XXXXXXXXXX 
            WHEN hire_date >= '2019-01-01' THEN 'New'
            XXXXXXXXXX 'Standard'
      END AS employee_type
FROM
      employees
A

Box 1: CASE -
CASE evaluates a list of conditions and returns one of multiple possible result expressions.
CASE can be used in any statement or clause that allows a valid expression. For example, you can use CASE in statements such as SELECT, UPDATE,
DELETE and SET, and in clauses such as select_list, IN, WHERE, ORDER BY, and HAVING.
Syntax: Simple CASE expression:
CASE input_expression -
WHEN when_expression THEN result_expression [ …n ]
[ ELSE else_result_expression ]
END -

Box 2: ELSE -
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/case-transact-sql

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

HARD HARD
DRAG DROP -
You have an Azure Synapse Analytics workspace named WS1.
You have an Azure Data Lake Storage Gen2 container that contains JSON-formatted files in the following format.

{
    "id": "66532691-ab20-11ea-8b1d-936b3ec64e54",
    "context": {
        "data": {
            "eventTime": "2020-06-10T13:43:34.5532",
            "samplingRate": "100.0",
            "isSynthetic": "false"
        },
        "session": {
            "isFirst": "false",
            "id": "38619c14-7a23-4687-8268-95862c5326b1"
        },
        "custom": {
            "dimensions": [
                {
                    "customerInfo": {
                        "ProfileType": "ExpertUser",
                        "RoomName": "",
                        "CustomerName": "diamond",
                        "UserName": "XXXX@yahoo.com"
                    }
                },
                {
                    "customerInfo": {
                        "ProfileType": "Novice",
                        "RoomName": "",
                        "CustomerName": "topaz",
                        "UserName": "xxxx@outlook.com"
                    }
                }
            ]
        }
    }
}

You need to use the serverless SQL pool in WS1 to read the files.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:

Values
* opendatasource
* openjson
* openquery
* openrowset

SELECT *
FROM  [XXXXXXXXXX](
    BULK 'https://contoso.blob.core.windows.net/contosodw',
    FORMAT = 'CSV',
    FIELDTERMINATOR = '0x0b',
    FIELDQUOTE = '0x0b',
    ROWTERMINATOR = '0x0b'
) AS q
WITH (
    id VARCHAR(50),
    contextdateventTime VARCHAR(50) '$.context.data.eventTime',
    contextdatasamplingRate VARCHAR(50) '$.context.data.samplingRate',
    contextdataisSynthetic VARCHAR(50) '$.context.data.isSynthetic',
    contextsessionisFirst VARCHAR(50) '$.context.session.isFirst',
    contextsession VARCHAR(50) '$.context.session.id',
    contextcustomdimensions VARCHAR(MAX) '$.context.custom.dimensions'
) AS q1
CROSS APPLY [XXXXXXXXXX] (contextcustomdimensions)
WITH (
    ProfileType VARCHAR(50) '$.customerInfo.ProfileType',
    RoomName VARCHAR(50) '$.customerInfo.RoomName',
    CustomerName VARCHAR(50) '$.customerInfo.CustomerName',
    UserName VARCHAR(50) '$.customerInfo.UserName'
) AS q2;
A

Box 1: openrowset -
The easiest way to see to the content of your CSV file is to provide file URL to OPENROWSET function, specify csv FORMAT.

Box 2: openjson -
You can access your JSON files from the Azure File Storage share by using the mapped drive, as shown in the following example:
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-single-csv-file https://docs.microsoft.com/en-us/sql/relational-databases/json/import-json-documents-into-sql-server

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

DRAG DROP -
You have an Apache Spark DataFrame named temperatures. A sample of the data is shown in the following table.
go to site for img
You need to produce the following table by using a Spark SQL query.

How should you complete the query? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all.
You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:

Values
- CAST
- COLLATE
- CONVERT
- FLATTEN
- PIVOT
- UNPIVOT

SELECT * FROM (
SELECT YEAR (Date) Year, MONTH (Date) Month, Temp
FROM temperatures
WHERE date BETWEEN DATE '2019-01-01' AND DATE '2021-08-31'
)
 [XXXXXXXXXX] (
AVG (   [XXXXXXXXXX] (Temp AS DECIMAL(4, 1)))

FOR Month in (
1 JAN, 2 FEB, 3 MAR, 4 APR, 5 MAY, 6 JUN,
7 JUL, 8 AUG, 9 SEP, 10 OCT, 11 NOV, 12 DEC
)
)
ORDER BY Year ASC
A

Box 1: PIVOT -
PIVOT rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output. And PIVOT runs aggregations where they’re required on any remaining column values that are wanted in the final output.
Incorrect Answers:
UNPIVOT carries out the opposite operation to PIVOT by rotating columns of a table-valued expression into column values.

Box 2: CAST -
If you want to convert an integer value to a DECIMAL data type in SQL Server use the CAST() function.
Example:

SELECT -
CAST(12 AS DECIMAL(7,2) ) AS decimal_value;
Here is the result:
decimal_value
12.00
Reference:
https://learnsql.com/cookbook/how-to-convert-an-integer-to-a-decimal-in-sql-server/ https://docs.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

You have an Azure Data Factory that contains 10 pipelines.
You need to label each pipeline with its main purpose of either ingest, transform, or load. The labels must be available for grouping and filtering when using the monitoring experience in Data Factory.

What should you add to each pipeline?
A. a resource tag
B. a correlation ID
C. a run group ID
D. an annotation

A

Correct Answer: D 🗳️
Annotations are additional, informative tags that you can add to specific factory resources: pipelines, datasets, linked services, and triggers. By adding annotations, you can easily filter and search for specific factory resources.
Reference:
https://www.cathrinewilhelmsen.net/annotations-user-properties-azure-data-factory/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
112
Q

HOTSPOT -
The following code segment is used to create an Azure Databricks cluster.

{
    "num_workers": null,
    "autoscale": {
        "min workers": 2,
        "max_workers": 8
    },
    "cluster_name": "MyCluster",
    "spark_version": "latest-stable-scala2.11",
    "spark_conf": {
        "spark.databricks.cluster.profile": "serverless",
        "spark.databricks.repl.allowedLanguages": "sql,python,r"
    },
    "node_type_id": "Standard_DS13_v2",
    "ssh public_keys": [],
    "custom_tags": {
        "ResourceClass": "Serverless"
    },
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "autotermination_minutes": 90,
    "enable_elastic_disk": true,
    "init_scripts": []
}

For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.
Hot Area:

The Databricks cluster supports multiple concurrent users.        YES/NO

The Databricks cluster minimizes costs when running scheduled jobs that execute notebooks.        YES/NO

The Databricks cluster supports the creation of a Delta Lake table.        YES/NO
A

1. Yes
A cluster mode of ‘High Concurrency’ is selected, unlike all the others which are ‘Standard’. This results in a worker type of Standard_DS13_v2.
ref: https://adatis.co.uk/databricks-cluster-sizing/

2. NO
recommended: New Job Cluster.
When you run a job on a new cluster, the job is treated as a data engineering (job) workload subject to the job workload pricing. When you run a job on an existing cluster, the job is treated as a data analytics (all-purpose) workload subject to all-purpose workload pricing.
ref: https://docs.microsoft.com/en-us/azure/databricks/jobs
Scheduled batch workload- Launch new cluster via job
ref: https://docs.databricks.com/administration-guide/capacity-planning/cmbp.html#plan-capacity-and-control-cost

3.YES
Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns.
ref: https://docs.databricks.com/delta/index.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
113
Q

You are designing a statistical analysis solution that will use custom proprietary Python functions on near real-time data from Azure Event Hubs.
You need to recommend which Azure service to use to perform the statistical analysis. The solution must minimize latency.
What should you recommend?
A. Azure Synapse Analytics
B. Azure Databricks
C. Azure Stream Analytics
D. Azure SQL Database

A

My answer will be B
Stream Analytics supports “extending SQL language with JavaScript and C# user-defined functions (UDFs)”. There is no mention of Python support; hence Stream Analytics is not correct.
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction

Azure Databricks supports near real-time data from Azure Event Hubs. And includes support for R, SQL, Python, Scala, and Java. So I will go for option B.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
114
Q

HOTSPOT -
You have an enterprise data warehouse in Azure Synapse Analytics that contains a table named FactOnlineSales. The table contains data from the start of 2009 to the end of 2012.

You need to improve the performance of queries against FactOnlineSales by using table partitions. The solution must meet the following requirements:
✑ Create four partitions based on the order date.
✑ Ensure that each partition contains all the orders placed during a given calendar year.

How should you complete the T-SQL command? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

CREATE TABLE [dbo] . FactOnlineSales
([OnlineSalesKey] [int] NOT NULL,
[OrderDateKey] [datetime] NOT NULL,
[StoreKey] [int] NOT NULL,
[ProductKey] [int] NOT NULL,
[CustomerKey] [int] NOT NULL,
[SalesOrderNumber] [nvarchar] (20) NOT NULL,
[SalesQuantity] [int] NOT NULL,
[SalesAmount] [money] NOT NULL,
[UnitPrice] [money] NULL)
WITH (CLUSTERED COLUMNSTORE INDEX)
PARTITION ([OrderDateKey] RANGE [RIGHT / LEFT] FOR VALUES
( [ 20090101,20121231 / 20100101,20110101,20120101 / 20090101,20100101,20110101,20120101 ])

A

Range Left or Right, both are creating similar partition but there is difference in comparison
For example: in this scenario, when you use LEFT and 20100101,20110101,20120101
Partition will be, datecol<=20100101, datecol>20100101 and datecol<=20110101, datecol>20110101 and datecol<=20120101, datecol>20120101
But if you use range RIGHT and 20100101,20110101,20120101
Partition will be, datecol<20100101, datecol>=20100101 and datecol<20110101, datecol>=20110101 and datecol<20120101, datecol>=20120101
In this example, Range RIGHT will be suitable for calendar comparison Jan 1st to Dec 31st
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql?view=sql-server-ver15

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
115
Q

You need to implement a Type 3 slowly changing dimension (SCD) for product category data in an Azure Synapse Analytics dedicated SQL pool.
You have a table that was created by using the following Transact-SQL statement.

click here

Which two columns should you add to the table? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. [EffectiveStartDate] [datetime] NOT NULL,
B. [CurrentProductCategory] [nvarchar] (100) NOT NULL,
C. [EffectiveEndDate] [datetime] NULL,
D. [ProductCategory] [nvarchar] (100) NOT NULL,
E. [OriginalProductCategory] [nvarchar] (100) NOT NULL,

A

Correct Answer: BE 🗳️
A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD.
This type of tracking may be used for one or two columns in a dimension table. It is not common to use it for many members of the same table. It is often used in combination with Type 1 or Type 2 members.

Reference:
https://k21academy.com/microsoft-azure/azure-data-engineer-dp203-q-a-day-2-live-session-review/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
116
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You are designing an Azure Stream Analytics solution that will analyze Twitter data.

You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once.

Solution: You use a hopping window that uses a hop size of 10 seconds and a window size of 10 seconds.
Does this meet the goal?
A. Yes
B. No

A

Majority beleive The answer should be “Yes”. Hopping window with hop size equals window size should be the same as Tumbling window.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
117
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You are designing an Azure Stream Analytics solution that will analyze Twitter data.

You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once.

Solution: You use a hopping window that uses a hop size of 5 seconds and a window size 10 seconds.
Does this meet the goal?
A. Yes
B. No

A

Correct Answer: B 🗳️
Instead use a tumbling window. Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.

if the hop size is equivalent to the window size then it can be true, but because the hop size is smaller, then each tweet can be count more than one and the windows will overlap with each others.

Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
118
Q

link
HOTSPOT -
You are building an Azure Stream Analytics job to identify how much time a user spends interacting with a feature on a webpage.

The job receives events based on user actions on the webpage. Each row of data represents an event. Each event has a type of either ‘start’ or ‘end’.

You need to calculate the duration between start and end events.

How should you complete the query? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

SELECT
    [user],
    feature,
    [DATEADD (  /  DATEDIFF (  /  DATEPART ( ]
      second,
      [ISFIRST /  LAST /  TOPONE ]   (Time) OVER (PARTITION BY [user], feature LIMIT DURATION (hour, 1) WHEN Event = 'start'),
     Time) as duration
FROM input TIMESTAMP BY Time
WHERE
     Event = 'end'
A

Box 1: DATEDIFF -
DATEDIFF function returns the count (as a signed integer value) of the specified datepart boundaries crossed between the specified startdate and enddate.
Syntax: DATEDIFF ( datepart , startdate, enddate )

Box 2: LAST -
The LAST function can be used to retrieve the last event within a specific condition. In this example, the condition is an event of type Start, partitioning the search by PARTITION BY user and feature. This way, every user and feature is treated independently when searching for the Start event. LIMIT DURATION limits the search back in time to 1 hour between the End and Start events.
Example:

SELECT -
[user],
feature,
DATEDIFF(
second,
LAST(Time) OVER (PARTITION BY [user], feature LIMIT DURATION(hour,
1) WHEN Event = ‘start’),

Time) as duration -

FROM input TIMESTAMP BY Time -

WHERE -
Event = ‘end’
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-stream-analytics-query-patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
119
Q

You are creating an Azure Data Factory data flow that will ingest data from a CSV file, cast columns to specified types of data, and insert the data into a table in an Azure Synapse Analytic dedicated SQL pool. The CSV file contains three columns named username, comment, and date.

The data flow already contains the following:
✑ A source transformation.
✑ A Derived Column transformation to set the appropriate types of data.
✑ A sink transformation to land the data in the pool.

You need to ensure that the data flow meets the following requirements:
✑ All valid rows must be written to the destination table.
✑ Truncation errors in the comment column must be avoided proactively.
✑ Any rows containing comment values that will cause truncation errors upon insert must be written to a file in blob storage.

Which two actions should you perform? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.

A. To the data flow, add a sink transformation to write the rows to a file in blob storage.

B. To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors.

C. To the data flow, add a filter transformation to filter out rows that will cause truncation errors.

D. Add a select transformation to select only the rows that will cause truncation errors.

A

Correct Answer: AB 🗳️

A. To the data flow, add a sink transformation to write the rows to a file in blob storage.

This action ensures that the rows causing truncation errors, identified by the Conditional Split, are written to a file in blob storage. This meets the requirement of storing rows that would otherwise cause truncation errors upon insertion.

B. To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors.

The Conditional Split helps identify rows that may cause truncation errors based on specified conditions (in this case, the comment column). This separation allows handling these problematic rows separately.

B: Example:
1. This conditional split transformation defines the maximum length of “title” to be five. Any row that is less than or equal to five will go into the GoodRows stream.
Any row that is larger than five will go into the BadRows stream.

  1. This conditional split transformation defines the maximum length of “title” to be five. Any row that is less than or equal to five will go into the GoodRows stream.
    Any row that is larger than five will go into the BadRows stream.
    A:
  2. Now we need to log the rows that failed. Add a sink transformation to the BadRows stream for logging. Here, we’ll “auto-map” all of the fields so that we have logging of the complete transaction record. This is a text-delimited CSV file output to a single file in Blob Storage. We’ll call the log file “badrows.csv”.
  3. The completed data flow is shown below. We are now able to split off error rows to avoid the SQL truncation errors and put those entries into a log file.
    Meanwhile, successful rows can continue to write to our target database.

Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-data-flow-error-rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
120
Q

DRAG DROP -
You need to create an Azure Data Factory pipeline to process data for the following three departments at your company: Ecommerce, retail, and wholesale. The solution must ensure that data can also be processed for the entire company.

How should you complete the Data Factory data flow script? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.
Select and Place:

Values:
* all, ecommerce, retail, wholesale
* dept == ‘ecommerce’, dept == ‘retail’, dept == ‘wholesale’
* dept – ‘ecommerce’, dept – ‘wholesale’, dept == ‘retail’
* disjoint: false
* disjoint: true
* ecommerce, retail, wholesale, all

CleanData
split (
 [XXXXXXXXXX]
  [XXXXXXXXXX]
) ~> SplitByDept@ ( [XXXXXXXXXX])
A

The conditional split transformation routes data rows to different streams based on matching conditions. The conditional split transformation is similar to a CASE decision structure in a programming language. The transformation evaluates expressions, and based on the results, directs the data row to the specified stream.
Box 1: dept==’ecommerce’, dept==’retail’, dept==’wholesale’
First we put the condition. The order must match the stream labeling we define in Box 3.

Box 2: THIS IS DISPUTED
Majority say disjoint: true
I think disjoint: false as the arguments and sources are more compelling
disjoint is false because the data goes to the first matching condition. All remaining rows matching the third condition go to output stream all.

Box 3: ecommerce, retail, wholesale, all

Label the streams -
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/data-flow-conditional-split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
121
Q

TOUGH QUESTION
DRAG DROP -
You have an Azure Data Lake Storage Gen2 account that contains a JSON file for customers. The file contains two attributes named FirstName and LastName.

You need to copy the data from the JSON file to an Azure Synapse Analytics table by using Azure Databricks. A new column must be created that concatenates the FirstName and LastName values.

You create the following components:
✑ A destination table in Azure Synapse
✑ An Azure Blob storage container
✑ A service principal

Which five actions should you perform in sequence next in is Databricks notebook? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

Select and Place:

Actions
1. Mount the Data Lake Storage onto DBFS.
2. Write the results to a table in Azure Synapse.
3. Perform transformations on the file.
4. Specify a temporary folder to stage the data.
5. Write the results to Data Lake Storage.
6. Read the file into a data frame.
7. Drop the data frame.
8. Perform transformations on the data frame.

A

disputed, need to research, go with this if nothing else
To accomplish the task in an Azure Databricks notebook, the logical sequence of actions would be:

Step 1. Mount the Data Lake Storage onto DBFS: This allows access to the JSON file stored in Azure Data Lake Storage using the Databricks File System.

Step 2. Read the file into a data frame: Use Spark to read the JSON file into a DataFrame for processing.

Step 3. Perform transformations on the data frame: Apply transformations to concatenate the FirstName and LastName fields to create a new column.

Step 4. Specify a temporary folder to stage the data: Before writing the data to Azure Synapse, it is a common practice to stage it in a temporary folder.

Step 5. Write the results to a table in Azure Synapse: Finally, write the transformed DataFrame to the destination table in Azure Synapse Analytics.

These steps would ensure the JSON file data is properly transformed and loaded into Azure Synapse Analytics for further use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
122
Q

link
HOTSPOT -
You build an Azure Data Factory pipeline to move data from an Azure Data Lake Storage Gen2 container to a database in an Azure Synapse Analytics dedicated
SQL pool.

Data in the container is stored in the following folder structure.
/in/{YYYY}/{MM}/{DD}/{HH}/{mm}

The earliest folder is /in/2021/01/01/00/00. The latest folder is /in/2021/01/15/01/45.
You need to configure a pipeline trigger to meet the following requirements:
✑ Existing data must be loaded.
✑ Data must be loaded every 30 minutes.
✑ Late-arriving data of up to two minutes must be included in the load for the time at which the data should have arrived.

How should you configure the pipeline trigger? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Type:
* Event
* On-demand
* Schedule
* Tumbling window

Additional properties:
* Prefix: /in/, Event: Blob created
* Recurrence: 30 minutes, Start time: 2021-01-01T00:00
* Recurrence: 30 minutes, Start time: 2021-01-01T00:00, Delay: 2 minutes
* Recurrence: 32 minutes, Start time: 2021-01-15T01:45

A

Box 1: Tumbling window -
To be able to use the Delay parameter we select Tumbling window.
Box 2: Recurrence: 30 minutes, not 32 minutes. Delay: 2 minutes.

The amount of time to delay the start of data processing for the window. The pipeline run is started after the expected execution time plus the amount of delay.
The delay defines how long the trigger waits past the due time before triggering a new run. The delay doesn’t alter the window startTime.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
123
Q

HOTSPOT -
You are designing a near real-time dashboard solution that will visualize streaming data from remote sensors that connect to the internet. The streaming data must be aggregated to show the average value of each 10-second interval. The data will be discarded after being displayed in the dashboard.

The solution will use Azure Stream Analytics and must meet the following requirements:
✑ Minimize latency from an Azure Event hub to the dashboard.
✑ Minimize the required storage.
✑ Minimize development effort.

What should you include in the solution? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point
Hot Area:

Azure Stream Analytics input type:
* Azure Event Hub
* Azure SQL Database
* Azure Stream Analytics
* Microsoft Power BI

Azure Stream Analytics output type:
* Azure Event Hub
* Azure SQL Database
* Azure Stream Analytics
* Microsoft Power BI

Aggregation query location:
* Azure Event Hub
* Azure SQL Database
* Azure Stream Analytics
* Microsoft Power BI

A

Azure Event Hub
Microsoft Power BI
Azure Stream Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
124
Q

hard hard question
DRAG DROP -
You have an Azure Stream Analytics job that is a Stream Analytics project solution in Microsoft Visual Studio. The job accepts data generated by IoT devices in the JSON format.

You need to modify the job to accept data generated by the IoT devices in the Protobuf format.
Which three actions should you perform from Visual Studio on sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

Select and Place:

Actions

  • Change the Event Serialization Format to Protobuf in the input.json file of the job and reference the DLL.
  • Add an Azure Stream Analytics Custom Deserializer Project (.NET) project to the solution.
  • Add .NET deserializer code for Protobuf to the custom deserializer project.
  • Add .NET deserializer code for Protobuf to the Stream Analytics project.
  • Add an Azure Stream Analytics Application project to the solution.
A

2. Add an Azure Stream Analytics Custom Deserializer Project (.NET) project to the solution.

3. Add .NET deserializer code for Protobuf to the custom deserializer project.

Popular beleive in chat is that this next
1. Change the Event Serialization Format to Protobuf in the input.json file of the job and reference the DLL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
125
Q

You have an Azure Storage account and a data warehouse in Azure Synapse Analytics in the UK South region.

You need to copy blob data from the storage account to the data warehouse by using Azure Data Factory. The solution must meet the following requirements:
✑ Ensure that the data remains in the UK South region at all times.
✑ Minimize administrative effort.

Which type of integration runtime should you use?
A. Azure integration runtime
B. Azure-SSIS integration runtime
C. Self-hosted integration runtime

A

Correct Answer: A 🗳️

Incorrect Answers:
C: Self-hosted integration runtime is to be used On-premises.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
126
Q

HOTSPOT -
You have an Azure SQL database named Database1 and two Azure event hubs named HubA and HubB. The data consumed from each source is shown in the following table.

view table here

**You need to implement Azure Stream Analytics to calculate the average fare per mile by driver.

How should you configure the Stream Analytics input for each source?**To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

HubA:
* Stream
* Reference

HubB:
* Stream
* Reference

Database1:
* Stream
* Reference

A

HubA: Stream -

HubB: Stream -

Database1: Reference -

Reference data (also known as a lookup table) is a finite data set that is static or slowly changing in nature, used to perform a lookup or to augment your data streams. For example, in an IoT scenario, you could store metadata about sensors (which don’t change often) in reference data and join it with real time IoT data streams. Azure Stream Analytics loads reference data in memory to achieve low latency stream processing
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
127
Q

You have an Azure Stream Analytics job that receives clickstream data from an Azure event hub.
You need to define a query in the Stream Analytics job. The query must meet the following requirements:
✑ Count the number of clicks within each 10-second window based on the country of a visitor.
✑ Ensure that each click is NOT counted more than once.

How should you define the Query?
~~~
A. SELECT Country, Avg() AS Average FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SlidingWindow(second, 10)
B. SELECT Country, Count(
) AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, TumblingWindow(second, 10)
C. SELECT Country, Avg() AS Average FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, HoppingWindow(second, 10, 2)
D. SELECT Country, Count(
) AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SessionWindow(second, 5, 10)
~~~

A

Correct Answer: B 🗳️
Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window.
Example:
Incorrect Answers:
A: Sliding windows, unlike Tumbling or Hopping windows, output events only for points in time when the content of the window actually changes. In other words, when an event enters or exits the window. Every window has at least one event, like in the case of Hopping windows, events can belong to more than one sliding window.
C: Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap, so events can belong to more than one Hopping window result set. To make a Hopping window the same as a Tumbling window, specify the hop size to be the same as the window size.
D: Session windows group events that arrive at similar times, filtering out periods of time where there is no data.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
128
Q

Hard
HOTSPOT -
You are building an Azure Analytics query that will receive input data from Azure IoT Hub and write the results to Azure Blob storage.

You need to calculate the difference in the number of readings per sensor per hour.

How should you complete the query? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

SELECT sensorId,
         growth = reading -
-                      [ LAG / LAST / LEAD ]   (reading) OVER (PARTITION BY sensorId [ LIMIT DURATION / OFFSET / WHEN ] (hour, 1) )
FROM input
A

Box 1: LAG -
The LAG analytic operator allows one to look up a ג€previousג€ event in an event stream, within certain constraints. It is very useful for computing the rate of growth of a variable, detecting when a variable crosses a threshold, or when a condition starts or stops being true.

Box 2: LIMIT DURATION -
Example: Compute the rate of growth, per sensor:
SELECT sensorId,
growth = reading -
LAG(reading) OVER (PARTITION BY sensorId LIMIT DURATION(hour, 1))

FROM input -
Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/lag-azure-stream-analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
129
Q

You need to schedule an Azure Data Factory pipeline to execute when a new file arrives in an Azure Data Lake Storage Gen2 container.

Which type of trigger should you use?

A. on-demand
B. tumbling window
C. schedule
D. event

A

Correct Answer: D 🗳️
Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the arrival or deletion of a file in Azure
Blob Storage account.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
130
Q

You have two Azure Data Factory instances named ADFdev and ADFprod. ADFdev connects to an Azure DevOps Git repository.

You publish changes from the main branch of the Git repository to ADFdev.

You need to deploy the artifacts from ADFdev to ADFprod.

What should you do first?
A. From ADFdev, modify the Git configuration.
B. From ADFdev, create a linked service.
C. From Azure DevOps, create a release pipeline.
D. From Azure DevOps, update the main branch.

A

Correct Answer: C 🗳️
In Azure Data Factory, continuous integration and delivery (CI/CD) means moving Data Factory pipelines from one environment (development, test, production) to another.
Note: The following is a guide for setting up an Azure Pipelines release that automates the deployment of a data factory to multiple environments.
1. In Azure DevOps, open the project that’s configured with your data factory.
2. On the left side of the page, select Pipelines, and then select Releases.
3. Select New pipeline, or, if you have existing pipelines, select New and then New release pipeline.
4. In the Stage name box, enter the name of your environment.
5. Select Add artifact, and then select the git repository configured with your development data factory. Select the publish branch of the repository for the Default branch. By default, this publish branch is adf_publish.
6. Select the Empty job template.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
131
Q

You are developing a solution that will stream to Azure Stream Analytics. The solution will have both streaming data and reference data.

Which input type should you use for the reference data?

A. Azure Cosmos DB
B. Azure Blob storage
C. Azure IoT Hub
D. Azure Event Hubs

A

Correct Answer: B 🗳️
Stream Analytics supports Azure Blob storage and Azure SQL Database as the storage layer for Reference Data.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data

132
Q

You are designing an Azure Stream Analytics job to process incoming events from sensors in retail environments.

You need to process the events to produce a running average of shopper counts during the previous 15 minutes, calculated at five-minute intervals.

Which type of window should you use?
A. snapshot
B. tumbling
C. hopping
D. sliding

A

Correct Answer: C 🗳️
Unlike tumbling windows, hopping windows model scheduled overlapping windows. A hopping window specification consist of three parameters: the timeunit, the windowsize (how long each window lasts) and the hopsize (by how much each window moves forward relative to the previous one).
Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/hopping-window-azure-stream-analytics

133
Q

HOTSPOT -
You are designing a monitoring solution for a fleet of 500 vehicles. Each vehicle has a GPS tracking device that sends data to an Azure event hub once per minute.

You have a CSV file in an Azure Data Lake Storage Gen2 container. The file maintains the expected geographical area in which each vehicle should be.

You need to ensure that when a GPS position is outside the expected area, a message is added to another event hub for processing within 30 seconds. The solution must minimize cost.

What should you include in the solution? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Service:
* An Azure Synapse Analytics Apache Spark pool
* An Azure Synapse Analytics serverless SQL pool
* Azure Data Factory
* Azure Stream Analytics

Window:
* Hopping
* No window
* Session
* Tumbling

Analysis type:
* Event pattern matching
* Lagged record comparison
* Point within polygon
* Polygon overlap

A

Box 1: Azure Stream Analytics -

Box 2 is disputed
Box 2: No Window -
The “No window” option is chosen to ensure that Stream Analytics processes each event individually without aggregating them into windows. This setup allows for immediate processing of each GPS position.

Box 3: Point within polygon -
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

134
Q

You are designing an Azure Databricks table. The table will ingest an average of 20 million streaming events per day.

You need to persist the events in the table for use in incremental load pipeline jobs in Azure Databricks. The solution must minimize storage costs and incremental load times.

What should you include in the solution?
A. Partition by DateTime fields.
B. Sink to Azure Queue storage.
C. Include a watermark column.
D. Use a JSON format for physical data storage.

A

Correct Answer: B 🗳️
The Databricks ABS-AQS connector uses Azure Queue Storage (AQS) to provide an optimized file source that lets you find new files written to an Azure Blob storage (ABS) container without repeatedly listing all of the files. This provides two major advantages:
✑ Lower latency: no need to list nested directory structures on ABS, which is slow and resource intensive.
✑ Lower costs: no more costly LIST API requests made to ABS.
Reference:
https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/aqs

135
Q

Hard
HOTSPOT -
You have a self-hosted integration runtime in Azure Data Factory.

The current status of the integration runtime has the following configurations:
✑ Status: Running
✑ Type: Self-Hosted
✑ Version: 4.4.7292.1
✑ Running / Registered Node(s): 1/1
✑ High Availability Enabled: False
✑ Linked Count: 0
✑ Queue Length: 0
✑ Average Queue Duration. 0.00s

The integration runtime has the following node details:
✑ Name: X-M
✑ Status: Running
✑ Version: 4.4.7292.1
✑ Available Memory: 7697MB
✑ CPU Utilization: 6%
✑ Network (In/Out): 1.21KBps/0.83KBps
✑ Concurrent Jobs (Running/Limit): 2/14
✑ Role: Dispatcher/Worker
✑ Credential Status: In Sync

Use the drop-down menus to select the answer choice that completes each statement based on the information presented.

NOTE: Each correct selection is worth one point.
Hot Area:

If the X-M node becomes unavailable, all executed pipelines will:
* fail until the node comes back online
* switch to another integration runtime
* exceed the CPU limit

The number of concurrent jobs and the CPU usage indicate that the Concurrent Jobs (Running/Limit) value should be:
* raised
* lowered
* left as is

A

Disputed
Box 1: fail until the node comes back online
We see: High Availability Enabled: False
Note: Higher availability of the self-hosted integration runtime so that it’s no longer the single point of failure in your big data solution or cloud data integration with
Data Factory.

Box 2: lowered -
We see:
Concurrent Jobs (Running/Limit): 2/14
CPU Utilization: 6%
Note: When the processor and available RAM aren’t well utilized, but the execution of concurrent jobs reaches a node’s limits, scale up by increasing the number of concurrent jobs that a node can run

Reference:
https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime

136
Q

Hard
You have an Azure Databricks workspace named workspace1 in the Standard pricing tier.

You need to configure workspace1 to support autoscaling all-purpose clusters. The solution must meet the following requirements:
Automatically scale down workers when the cluster is underutilized for three minutes.
✑ Minimize the time it takes to scale to the maximum number of workers.
✑ Minimize costs.

What should you do first?
A. Enable container services for workspace1.
B. Upgrade workspace1 to the Premium pricing tier.
C. Set Cluster Mode to High Concurrency.
D. Create a cluster policy in workspace1.

A

Correct Answer: B 🗳️

Selected Answer: B
We definitely need “Optimized Autoscaling” (not Standard Autoscaling) which is only part of Premium Plan.

B is the correct answer.

Automated (job) clusters always use optimized autoscaling. The type of autoscaling performed on all-purpose clusters depends on the workspace configuration.

Standard autoscaling is used by all-purpose clusters in workspaces in the Standard pricing tier. Optimized autoscaling is used by all-purpose clusters in the Azure Databricks Premium Plan.
https://docs.databricks.com/clusters/cluster-config-best-practices.html

For clusters running Databricks Runtime 6.4 and above, optimized autoscaling is used by all-purpose clusters in the Premium plan
Optimized autoscaling:
Scales up from min to max in 2 steps.
Can scale down even if the cluster is not idle by looking at shuffle file state.
Scales down based on a percentage of current nodes.
On job clusters, scales down if the cluster is underutilized over the last 40 seconds.
On all-purpose clusters, scales down if the cluster is underutilized over the last 150 seconds.
The spark.databricks.aggressiveWindowDownS Spark configuration property specifies in seconds how often a cluster makes down-scaling decisions. Increasing the value causes a cluster to scale down more slowly. The maximum value is 600.

Note: Standard autoscaling -
Starts with adding 8 nodes. Thereafter, scales up exponentially, but can take many steps to reach the max. You can customize the first step by setting the spark.databricks.autoscaling.standardFirstStepUp Spark configuration property.
Scales down only when the cluster is completely idle and it has been underutilized for the last 10 minutes.
Scales down exponentially, starting with 1 node.
Reference:
https://docs.databricks.com/clusters/configure.html

137
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You are designing an Azure Stream Analytics solution that will analyze Twitter data.
You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once.

Solution: You use a tumbling window, and you set the window size to 10 seconds.
Does this meet the goal?

A. Yes
B. No

A

Correct Answer: A 🗳️
Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. The following diagram illustrates a stream with a series of events and how they are mapped into 10-second tumbling windows.

Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

138
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You are designing an Azure Stream Analytics solution that will analyze Twitter data.
You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once.

Solution: You use a session window that uses a timeout size of 10 seconds.
Does this meet the goal?

A. Yes
B. No

A

Correct Answer: B 🗳️
Instead use a tumbling window. Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.
Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Variations of this question
Solution: You use a hopping window that uses a hop size of 10 seconds and a window size of 10 seconds. YES
Solution: You use a hopping window that uses a hop size of 5 seconds and a window size 10 seconds. NO
Solution: You use a session window that uses a timeout size of 10 seconds.
Does this meet the goal? NO
Solution: You use a tumbling window, and you set the window size to 10 seconds.
Does this meet the goal? YES

139
Q

You use Azure Stream Analytics to receive data from Azure Event Hubs and to output the data to an Azure Blob Storage account.

You need to output the count of records received from the last five minutes every minute.
Which windowing function should you use?

A. Session
B. Tumbling
C. Sliding
D. Hopping

A

Correct Answer: D 🗳️
Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap and be emitted more often than the window size. Events can belong to more than one Hopping window result set. To make a Hopping window the same as a Tumbling window, specify the hop size to be the same as the window size.

Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

140
Q

HOTSPOT -
You configure version control for an Azure Data Factory instance as shown in the following exhibit.

link

Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.

NOTE: Each correct selection is worth one point.
Hot Area:

Azure Resource Manager (ARM) templates for the pipeline assets are
stored in [answer choice]

* /
* adf_publish
* main
* Parameterization template

A Data Factory Azure Resource Manager (ARM) template named
contososales can be found in [answer choice]

* /
* /contososales
* /dwh_batchetl/adf_publish/contososales
* /main

A

DISPUTED
Box 1: adf_publish -
The Publish branch is the branch in your repository where publishing related ARM templates are stored and updated. By default, it’s adf_publish.
Box 2: / dwh_batchetl/adf_publish/contososales
Note: RepositoryName (here dwh_batchetl): Your Azure Repos code repository name. Azure Repos projects contain Git repositories to manage your source code as your project grows. You can create a new repository or use an existing repository that’s already in your project.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/source-control

141
Q

hard
HOTSPOT -
You are designing an Azure Stream Analytics solution that receives instant messaging data from an Azure Event Hub.

You need to ensure that the output from the Stream Analytics job counts the number of messages per time zone every 15 seconds.

How should you complete the Stream Analytics query? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Select TimeZone, count (*) AS MessageCount

FROM MessageStream      [ LAST / OVER / SYSTEM.TIMESTAMP() / TIMESTAMP BY ]      CreatedAt

GROUP BY TimeZone, [ HOPPINGWINDOW / SESSIONWINDOW / SLIDINGWINDOW / TUMBLINGWINDOW ] (second, 15)
A

Box 1: timestamp by -

Box 2: TUMBLINGWINDOW -
Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window.

Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

142
Q

HOTSPOT -
You have an Azure Data Factory instance named ADF1 and two Azure Synapse Analytics workspaces named WS1 and WS2.

ADF1 contains the following pipelines:
✑ P1: Uses a copy activity to copy data from a nonpartitioned table in a dedicated SQL pool of WS1 to an Azure Data Lake Storage Gen2 account
✑ P2: Uses a copy activity to copy data from text-delimited files in an Azure Data Lake Storage

Gen2 account to a nonpartitioned table in a dedicated SQL pool of WS2

You need to configure P1 and P2 to maximize parallelism and performance.

Which dataset settings should you configure for the copy activity if each pipeline? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

P1:
* Set the Copy method to Bulk insert
* Set the Copy method to PolyBase
* Set the Isolation level to Repeatable read
* Set the Partition option to Dynamic range

P2:
* Set the Copy method to Bulk insert
* Set the Copy method to PolyBase
* Set the Isolation level to Repeatable read
* Set the Partition option to Dynamic range

A

P1) Set the Partition option to Dynamic range according to chat

p2) set the copy method to PolyBase according to chat

143
Q

HOTSPOT -
You have an Azure Storage account that generates 200,000 new files daily. The file names have a format of {YYYY}/{MM}/{DD}/{HH}/{CustomerID}.csv.

You need to design an Azure Data Factory solution that will load new data from the storage account to an Azure Data Lake once hourly. The solution must minimize load times and costs.

How should you configure the solution? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Load methodology:
* Full Load
* Incremental Load
* Load individual files as they arrive

Trigger:
* Fixed schedule
* New file
* Tumbling window

A

1) Incremental Load
2) Tumbling Window some debate for this to be scheduled

Seems like you could go with either Schedule trigger or Tumbling Window here. I would use the latter option, and pass the windowStart system variable to the pipeline as a parameter, allowing me to more easily navigate to the proper directory in the storage account.

144
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:
✑ A workload for data engineers who will use Python and SQL.
✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.
✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.

The enterprise architecture team at your company identifies the following standards for Databricks environments:
✑ The data engineers must share a cluster.
✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster.
✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists.

You need to create the Databricks clusters for the workloads.

Solution: You create a Standard cluster for each data scientist, a Standard cluster for the data engineers, and a High Concurrency cluster for the jobs.

Does this meet the goal?
A. Yes
B. No

A

Correct Answer: B 🗳️
We need a High Concurrency cluster for the data engineers and the jobs.
Note: Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.
A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies.
Reference:
https://docs.azuredatabricks.net/clusters/configure.html

145
Q

You have the following Azure Data Factory pipelines:
✑ Ingest Data from System1
✑ Ingest Data from System2
✑ Populate Dimensions
✑ Populate Facts

Ingest Data from System1 and Ingest Data from System2 have no dependencies. Populate Dimensions must execute after Ingest Data from System1 and Ingest Data from System2. Populate Facts must execute after Populate Dimensions pipeline. All the pipelines must execute every eight hours.

What should you do to schedule the pipelines for execution?
A. Add an event trigger to all four pipelines.
B. Add a schedule trigger to all four pipelines.
C. Create a patient pipeline that contains the four pipelines and use a schedule trigger.
D. Create a patient pipeline that contains the four pipelines and use an event trigger.

A

Correct Answer: C 🗳️
Schedule trigger: A trigger that invokes a pipeline on a wall-clock schedule.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers

146
Q

hard hard - no clear anwser
DRAG DROP -
You are responsible for providing access to an Azure Data Lake Storage Gen2 account.

Your user account has contributor access to the storage account, and you have the application ID and access key.

You plan to use PolyBase to load data into an enterprise data warehouse in Azure Synapse Analytics.

You need to configure PolyBase to connect the data warehouse to storage account.

Which three components should you create in sequence? To answer, move the appropriate components from the list of components to the answer area and arrange them in the correct order.

Select and Place:

Components
1. a database scoped credential
2. an asymmetric key
3. an external data source
4. a database encryption key
5. an external file format

A

So unsure on this one
1- Create a Database Encryption Key
This is disputed in chat, but seems like most are going for this

Step 2: a database scoped credential
Create a Database Scoped Credential. A Database Scoped Credential is a record that contains the authentication information required to connect an external resource. The master key needs to be created first before creating the database scoped credential.

Step 3: an external data source -
Create an External Data Source. External data sources are used to establish connectivity for data loading using Polybase.
Reference:
https://www.sqlservercentral.com/articles/access-external-data-from-azure-synapse-analytics-using-polybase

147
Q

You are monitoring an Azure Stream Analytics job by using metrics in Azure.

You discover that during the last 12 hours, the average watermark delay is consistently greater than the configured late arrival tolerance.

What is a possible cause of this behavior?
A. Events whose application timestamp is earlier than their arrival time by more than five minutes arrive as inputs.
B. There are errors in the input data.
C. The late arrival policy causes events to be dropped.
D. The job lacks the resources to process the volume of incoming data.

A

Correct Answer: D 🗳️
Watermark Delay indicates the delay of the streaming data processing job.
There are a number of resource constraints that can cause the streaming pipeline to slow down. The watermark delay metric can rise due to:
1. Not enough processing resources in Stream Analytics to handle the volume of input events. To scale up resources, see Understand and adjust Streaming
Units.
2. Not enough throughput within the input event brokers, so they are throttled. For possible solutions, see Automatically scale up Azure Event Hubs throughput units.
3. Output sinks are not provisioned with enough capacity, so they are throttled. The possible solutions vary widely based on the flavor of output service being used.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-time-handling

148
Q

HOTSPOT -
You are building an Azure Stream Analytics job to retrieve game data.

You need to ensure that the job returns the highest scoring record for each five-minute time interval of each game.

How should you complete the Stream Analytics query? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

SELECT [Collect(Score) / CollectTop(1) OVER(ORDER BY Score Desc) / Game, MAX(Score) / TopOne() OVER(PARTITION BY Game ORDER BY Score Desc) ] as HighestScore 

FROM input TIMESTAMP BY CreatedAt

GROUP BY [ Game / Hopping(minute,5) / Tumbling(minute,5) / Windows(TumblingWindow(minute,5),Hopping(minute,5)) ]
A

Box 1: TopOne OVER(PARTITION BY Game ORDER BY Score Desc)
TopOne returns the top-rank record, where rank defines the ranking position of the event in the window according to the specified ordering. Ordering/ranking is based on event columns and can be specified in ORDER BY clause.
Box 2: TumblingWindow(minute, 5) according to chat
- This window function groups the events into non-overlapping, continuous five-minute intervals, which is what’s required to get the highest score in each five-minute time slice.

This configuration will ensure that you get the highest score for each game every five minutes.

Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/topone-azure-stream-analytics https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

149
Q

link 4 variations of this question
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Data Lake Storage account that contains a staging zone.

You need to design a daily process to ingest incremental data from the staging zone, transform the data by executing an R script, and then insert the transformed data into a data warehouse in Azure Synapse Analytics.

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script.
Does this meet the goal?

A. Yes
B. No

A

Correct Answer B:
The solution proposed does not meet the goal because it suggests executing the R script using a stored procedure in the data warehouse. Azure Synapse Analytics does not support executing R scripts directly within stored procedures. Instead, you should use Azure Data Factory to orchestrate the process, using an Azure Machine Learning activity to execute the R script for data transformation before loading the transformed data into Azure Synapse Analytics.

VARIATIONS OF THIS QUESTION
Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. NO

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse. YES

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse. NO

Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse. YES

150
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:
✑ A workload for data engineers who will use Python and SQL.
✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.
✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.

The enterprise architecture team at your company identifies the following standards for Databricks environments:
✑ The data engineers must share a cluster.
✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster.
✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists.

You need to create the Databricks clusters for the workloads.

Solution: You create a High Concurrency cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the jobs.
Does this meet the goal?
A. Yes
B. No

A

data scientist need scala so standard, jobs need scala so standard, so B but for different reasons

151
Q

You are designing an Azure Databricks cluster that runs user-defined local processes.

You need to recommend a cluster configuration that meets the following requirements:
✑ Minimize query latency.
✑ Maximize the number of users that can run queries on the cluster at the same time.
✑ Reduce overall costs without compromising other requirements.

Which cluster type should you recommend?

A. Standard with Auto Termination
B. High Concurrency with Autoscaling
C. High Concurrency with Auto Termination
D. Standard with Autoscaling

A

Correct Answer: B 🗳️
A High Concurrency cluster is a managed cloud resource. The key benefits of High Concurrency clusters are that they provide fine-grained sharing for maximum resource utilization and minimum query latencies.
Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling. Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to match a workload.
Incorrect Answers:
C: The cluster configuration includes an auto terminate setting whose default value depends on cluster mode:
Standard and Single Node clusters terminate automatically after 120 minutes by default.
High Concurrency clusters do not terminate automatically by default.
Reference:
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure

B is correct answer.
High concurrency cluster cannot terminated, so C is wrong.
Standard cluster cannot shared by multiple tasks, so A and D are wrong.

152
Q

HARD
HOTSPOT -
You are building an Azure Data Factory solution to process data received from Azure Event Hubs, and then ingested into an Azure Data Lake Storage Gen2 container.

The data will be ingested every five minutes from devices into JSON files. The files have the following naming pattern.

/{deviceType}/in/{YYYY}/{MM}/{DD}/{HH}/{deviceID}_{YYYY}{MM}{DD}HH}{mm}.json

You need to prepare the data for batch data processing so that there is one dataset per hour per deviceType. The solution must minimize read times.

How should you configure the sink for the copy activity? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Parameter:
* @pipeline(),TriggerTime
* @pipeline(),TriggerType
* @trigger().outputs.windowStartTime
* @trigger().startTime

Naming pattern:
* /{deviceID}/out/{YYYY}/{MM}/{DD}/{HH}.json
* /{YYYY}/{MM}/{DD}/{deviceType}.json
* /{YYYY}/{MM}/{DD}/{HH}.json
* /{YYYY}/{MM}/{DD}/{HH}_{deviceType}.json

Copy behavior:
* Add dynamic content
* Flatten hierarchy
* Merge files

A

Box 1: @trigger().outputs.windowStartTime -
startTime: A date-time value. For basic schedules, the value of the startTime property applies to the first occurrence. For complex schedules, the trigger starts no sooner than the specified startTime value.

Box 2: /{YYYY}/{MM}/{DD}/{HH}_{deviceType}.json
One dataset per hour per deviceType.

Box 3: merge - not flatten hierarchy.
The question starts with a folder structure as the following:
/{deviceType}/in/{YYYY}/{MM}/{DD}/{HH}/{deviceID}_{YYYY}{MM}{DD}HH}{mm}.json

It indicates there are multiple device ID JSON files per deviceType. Those need to be merged to get the target naming pattern - “one file per device type per hour.”
The target naming pattern is the following:
/{YYYY}/{MM}/{DD}/{HH}_{deviceType}.json

The correct copy behavior is “Merge” because there are multiple files in the source folder that are merged into a single folder per device type per hour.

153
Q

DRAG DROP -
You are designing an Azure Data Lake Storage Gen2 structure for telemetry data from 25 million devices distributed across seven key geographical regions. Each minute, the devices will send a JSON payload of metrics to Azure Event Hubs.

You need to recommend a folder structure for the data. The solution must meet the following requirements:

✑ Data engineers from each region must be able to build their own pipelines for the data of their respective region only.
✑ The data must be processed at least once every 15 minutes for inclusion in Azure Synapse Analytics serverless SQL pools.

How should you recommend completing the structure? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.
Select and Place:

Values
* {deviceID]
* {mm) / (HH} / {DD) / (MM} / {YYYY}
* {regionID]/{deviceID]
* {regionID]/raw
* {YYYY} / {MM] / {DD} / {HH)
* {YYYY} /{MM) / {DD} / (HH] / {mm}
* raw/(deviceID)
* raw/(regionID)

Anwser area
/ [XXXXXXXXXX] / [XXXXXXXXXX] / [XXXXXXXXXX].json

A

The correct answer is
{raw/regionID}/{YYYY}/{MM}/{DD}/{HH}/{mm}/{deviceID}.json

{raw/regionID} is the first level because raw is the container name for the raw data. RegionID follows it for ease of managing security.

{YYYY}/{MM}/{DD}/{HH}/{mm}/{deviceID}.json instead of {deviceID}/{YYYY}/{MM}/{DD}/{HH}/{mm}.json. The primary reason is that you want your namespace structure to have as few folders as high up and narrow those down as you get deeper into your structure.

For example, if you have 1 year worth of data and 25 million devices, using {YYYY}/{MM}/{DD}/{HH}/{mm}/ results in 2.1 million folders (1 year * 12 months * 30 days [estimate] * 24 hours * 60 minutes). If you start your folder structure with {deviceID}, you end up with 25 million folders - one for each device - before you even get to including the date in the hierarchy.

154
Q

HOTSPOT -
You are implementing an Azure Stream Analytics solution to process event data from devices.

The devices output events when there is a fault and emit a repeat of the event every five seconds until the fault is resolved. The devices output a heartbeat event every five seconds after a previous event if there are no faults present.

A sample of the events is shown in the following table.

| DevicelD                                | EventType                | EventTime              |
|-----------------------------------------|--------------------------|------------------------|
| 78cc5ht9-w357-684r-w4fr-kr16h6p9874e    | HeartBeat                | 2020-12-01T19:00.000Z |
| 78cc5ht9-w357-684r-w4fr-kr16h6p9874e    | HeartBeat                | 2020-12-01T19:05.000Z |
| 78cc5ht9-w357-684r-w4fr-kr16h6p9874e    | TemperatureSensorFault   | 2020-12-01T19:07.000Z |

You need to calculate the uptime between the faults.

How should you complete the Stream Analytics SQL query? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

SELECT
DeviceID,
MIN (EventTime) as StartTime,
MAX (EventTime) as EndTime,
DATEDIFF (second, MIN (EventTime), MAX (EventTime) ) AS duration_in_seconds
FROM input TIMESTAMP BY EventTime
[ WHERE EventType='HeartBeat' / WHERE LAG(EventType, 1) OVER (LIMIT DURATION(second,5)) <> EventType / WHERE IsFirst(second,5) = 1 ]
GROUP BY
DeviceID
[ SessionWindow(second, 5, 50000) OVER (PARTITION BY DeviceID) / ,TumblingWindow(second,5) / HAVING DATEDIFF(second, MIN(EventTime), MAX(EventTime)) > 5 ]
A

Box 1: WHERE EventType=’HeartBeat’
Box 2: SessionWindow(second, 5, 50000) OVER (PARTITION BY DeviceID) according to chat
If we want to calculate the uptime between the faults, we must use session window for each device, we know that will be receiving events for each 5 seconds if there is no error, so when an error occurs (or if we reach the maximum size of the window) then a new event will not be received within the next 5 seconds and the window will close, calculating the uptime. However if We use Tumbling window, it´s not possible to calculate the uptime beyond 5 seconds

155
Q

You are creating a new notebook in Azure Databricks that will support R as the primary language but will also support Scala and SQL.

Which switch should you use to switch between languages?
A. %<language>
B. @<Language>
C. \\[<language>]
D. \\(<language>)</language></language></Language></language>

A

Correct Answer: A 🗳️
To change the language in Databricks’ cells to either Scala, SQL, Python or R, prefix the cell with ‘%’, followed by the language.
%python //or r, scala, sql
Reference:
https://www.theta.co.nz/news-blogs/tech-blog/enhancing-digital-twins-part-3-predictive-maintenance-with-azure-databricks

156
Q

You have an Azure Data Factory pipeline that performs an incremental load of source data to an Azure Data Lake Storage Gen2 account.

Data to be loaded is identified by a column named LastUpdatedDate in the source table.
You plan to execute the pipeline every four hours.

You need to ensure that the pipeline execution meets the following requirements:
✑ Automatically retries the execution when the pipeline run fails due to concurrency or throttling limits.
✑ Supports backfilling existing data in the table.

Which type of trigger should you use?
A. event
B. on-demand
C. schedule
D. tumbling window

A

Correct Answer: D 🗳️
In case of pipeline failures, tumbling window trigger can retry the execution of the referenced pipeline automatically, using the same input parameters, without the user intervention. This can be specified using the property “retryPolicy” in the trigger definition.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger

157
Q

You are designing a solution that will copy Parquet files stored in an Azure Blob storage account to an Azure Data Lake Storage Gen2 account.

The data will be loaded daily to the data lake and will use a folder structure of {Year}/{Month}/{Day}/.

You need to design a daily Azure Data Factory data load to minimize the data transfer between the two accounts.

Which two configurations should you include in the design? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point
A. Specify a file naming pattern for the destination.
B. Delete the files in the destination before loading the data.
C. Filter by the last modified date of the source files.
D. Delete the source files after they are copied.

A

Correct Answer: AC 🗳️
Copy only the daily files by using filtering.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-data-lake-storage

A. Specify a file naming pattern for the destination:
By specifying a file naming pattern for the destination files in the Azure Data Lake Storage Gen2 account, you can ensure that the files are organized and stored in a structured manner. This can help with data management and subsequent processing.

C. Filter by the last modified date of the source files:
By filtering the source files based on the last modified date, you can select only the files that have been modified on the current day. This reduces the amount of data transferred and improves the efficiency of the data load process.

158
Q

Hard- poorly worded so anwser not clear

You plan to build a structured streaming solution in Azure Databricks. The solution will count new events in five-minute intervals and report only events that arrive during the interval. The output will be sent to a Delta Lake table.

Which output mode should you use?
A. update
B. complete
C. append

A

Correct Answer: C 🗳️
Append Mode: Only new rows appended in the result table since the last trigger are written to external storage. This is applicable only for the queries where existing rows in the Result Table are not expected to change.
Incorrect Answers:
B: Complete Mode: The entire updated result table is written to external storage. It is up to the storage connector to decide how to handle the writing of the entire table.
A: Update Mode: Only the rows that were updated in the result table since the last trigger are written to external storage. This is different from Complete Mode in that Update Mode outputs only the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it is equivalent to Append mode.
Reference:
https://docs.databricks.com/getting-started/spark/streaming.html

159
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Synapse Analytics dedicated SQL pool that contains a table named Table1.

You have files that are ingested and loaded into an Azure Data Lake Storage Gen2 container named container1.

You plan to insert data from the files in container1 into Table1 and transform the data. Each row of data in the files will produce one row in the serving layer of Table1.

You need to ensure that when the source data files are loaded to container1, the DateTime is stored as an additional column in Table1.

Solution: In an Azure Synapse Analytics pipeline, you use a data flow that contains a Derived Column transformation.

Does this meet the goal?
A. Yes
B. No

A

This is the only variation of this question thats YES
Correct Answer: A 🗳️
Use the derived column transformation to generate new columns in your data flow or to modify existing fields.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/data-flow-derived-column
“Data flows are available both in Azure Data Factory and Azure Synapse Pipelines”
“Use the derived column transformation to generate new columns in your data flow or to modify existing fields.”

https://docs.microsoft.com/en-us/azure/data-factory/data-flow-derived-column

160
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Synapse Analytics dedicated SQL pool that contains a table named Table1.

You have files that are ingested and loaded into an Azure Data Lake Storage Gen2 container named container1.

You plan to insert data from the files in container1 into Table1 and transform the data. Each row of data in the files will produce one row in the serving layer of Table1.

You need to ensure that when the source data files are loaded to container1, the DateTime is stored as an additional column in Table1.

Solution: You use a dedicated SQL pool to create an external table that has an additional DateTime column.

Does this meet the goal?
A. Yes
B. No

A

Correct Answer: B 🗳️
Instead use the derived column transformation to generate new columns in your data flow or to modify existing fields.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/data-flow-derived-column

Selected Answer: B
Answer should be B.
An external table is based on a source flat file structure. It seems to make no sense to add additional date time columns to such a table.

161
Q

Hard
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Synapse Analytics dedicated SQL pool that contains a table named Table1.
You have files that are ingested and loaded into an Azure Data Lake Storage Gen2 container named container1.

You plan to insert data from the files in container1 into Table1 and transform the data. Each row of data in the files will produce one row in the serving layer of Table1.

You need to ensure that when the source data files are loaded to container1, the DateTime is stored as an additional column in Table1.

Solution: You use an Azure Synapse Analytics serverless SQL pool to create an external table that has an additional DateTime column.

Does this meet the goal?
A. Yes
B. No

A

You can’t use serverless pool to create table in dedicate pool

Correct Answer: B 🗳️
Instead use the derived column transformation to generate new columns in your data flow or to modify existing fields.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/data-flow-derived-column

162
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Synapse Analytics dedicated SQL pool that contains a table named Table1.
You have files that are ingested and loaded into an Azure Data Lake Storage Gen2 container named container1.

You plan to insert data from the files in container1 into Table1 and transform the data. Each row of data in the files will produce one row in the serving layer of Table1.

You need to ensure that when the source data files are loaded to container1, the DateTime is stored as an additional column in Table1.

Solution: In an Azure Synapse Analytics pipeline, you use a Get Metadata activity that retrieves the DateTime of the files.

Does this meet the goal?
A. Yes
B. No

A

Variations of this question
Solution: In an Azure Synapse Analytics pipeline, you use a Get Metadata activity that retrieves the DateTime of the files. NO
Solution: You use an Azure Synapse Analytics serverless SQL pool to create an external table that has an additional DateTime column. NO
Solution: You use a dedicated SQL pool to create an external table that has an additional DateTime column. NO
Solution: In an Azure Synapse Analytics pipeline, you use a data flow that contains a Derived Column transformation. YES

Correct Answer: B 🗳️
Instead use a serverless SQL pool to create an external table with the extra column.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-external-tables

163
Q

link 4 variations of this question
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Data Lake Storage account that contains a staging zone.

You need to design a daily process to ingest incremental data from the staging zone, transform the data by executing an R script, and then insert the transformed data into a data warehouse in Azure Synapse Analytics.

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse.

Does this meet the goal?
A. Yes
B. No

A

Selected Answer: A
I think is A. Yes.
You can execute R code in a notebook, and then call it from Data Factory.
You can check it at “Databricks Notebook activity” header:
https://docs.microsoft.com/en-US/azure/data-factory/transform-data
And also:
https://docs.microsoft.com/en-us/azure/databricks/spark/latest/sparkr/overview

VARIATIONS OF THIS QUESTION
Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. NO

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse. YES

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse. NO

Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse. YES

164
Q

link 4 variations of this question
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Data Lake Storage account that contains a staging zone.

You need to design a daily process to ingest incremental data from the staging zone, transform the data by executing an R script, and then insert the transformed data into a data warehouse in Azure Synapse Analytics.

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse.

Does this meet the goal?
A. Yes
B. No

A

Correct Answer: B 🗳️
If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity, not a mapping flow,5 with your own data processing logic and use the activity in the pipeline. You can create a custom activity to run R scripts on your HDInsight cluster with R installed.
Reference:
https://docs.microsoft.com/en-US/azure/data-factory/transform-data

Selected Answer: B
Is correct.
Mapping Dataflows can’t execute R code that is a requeriment, so not meet the goal.

VARIATIONS OF THIS QUESTION
Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. NO

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse. YES

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse. NO

Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse. YES

165
Q

link 4 variations of this question

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Data Lake Storage account that contains a staging zone.

You need to design a daily process to ingest incremental data from the staging zone, transform the data by executing an R script, and then insert the transformed data into a data warehouse in Azure Synapse Analytics.

Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse.

Does this meet the goal?
A. Yes
B. No

A

Variations of this question
Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. NO
Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse. YES
Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse. NO
Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse. YES

Selected Answer: A
The correct answer is “A. Yes”
You can execute R code in a notebook, and then call it from Data Factory.
You can check it at “Databricks Notebook activity” header:
https://docs.microsoft.com/en-US/azure/data-factory/transform-data
And also:
https://docs.microsoft.com/en-us/azure/databricks/spark/latest/sparkr/overview

VARIATIONS OF THIS QUESTION
Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. NO

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse. YES

Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse. NO

Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse. YES

166
Q

You plan to create an Azure Data Factory pipeline that will include a mapping data flow.

You have JSON data containing objects that have nested arrays.

You need to transform the JSON-formatted data into a tabular dataset. The dataset must have one row for each item in the arrays.

Which transformation method should you use in the mapping data flow?
A. new branch
B. unpivot
C. alter row
D. flatten

A

Correct Answer: D 🗳️
Use the flatten transformation to take array values inside hierarchical structures such as JSON and unroll them into individual rows. This process is known as denormalization.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten

167
Q

You use Azure Stream Analytics to receive Twitter data from Azure Event Hubs and to output the data to an Azure Blob storage account.

You need to output the count of tweets during the last five minutes every five minutes. Each tweet must only be counted once.

Which windowing function should you use?
A. a five-minute Sliding window
B. a five-minute Session window
C. a five-minute Hopping window that has a one-minute hop
D. a five-minute Tumbling window

A

Correct Answer: D 🗳️
Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window.

Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

168
Q

You are planning a streaming data solution that will use Azure Databricks. The solution will stream sales transaction data from an online store. The solution has the following specifications:
✑ The output data will contain items purchased, quantity, line total sales amount, and line total tax amount.
✑ Line total sales amount and line total tax amount will be aggregated in Databricks.
Sales transactions will never be updated. Instead, new rows will be added to adjust a sale.

You need to recommend an output mode for the dataset that will be processed by using Structured Streaming. The solution must minimize duplicate data.

What should you recommend?
A. Update
B. Complete
C. Append

A

C. Append

For the given scenario, where sales transactions are never updated but new rows are added to adjust a sale, the recommended output mode for the dataset processed by using Structured Streaming in Azure Databricks is “Append”.

The “Append” output mode ensures that only new rows are added to the output data as they arrive in the streaming data source. It appends the new rows to the existing result without modifying or updating previously processed data. This mode is suitable when you want to continuously append new records to the output data without duplicating or modifying existing data.

In this case, as new rows are added to adjust a sale, the “Append” mode will capture these new rows and include them in the output data, allowing you to aggregate the line total sales amount and line total tax amount in Databricks while minimizing duplicate data.

169
Q

Hard
You have an enterprise data warehouse in Azure Synapse Analytics named DW1 on a server named Server1.

You need to determine the size of the transaction log file for each distribution of DW1.

What should you do?

A. On DW1, execute a query against the sys.database_files dynamic management view.

B. From Azure Monitor in the Azure portal, execute a query against the logs of DW1.

C. Execute a query against the logs of DW1 by using the Get-AzOperationalInsightsSearchResult PowerShell cmdlet.

D. On the master database, execute a query against the sys.dm_pdw_nodes_os_performance_counters dynamic management view.

A

almost always azure monitor but not this time

The question asks for transaction log size on each distribution. The correct answer is D: Link below: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-monitor
– Transaction log size
SELECT
instance_name as distribution_db,
cntr_value*1.0/1048576 as log_file_size_used_GB,
pdw_node_id
FROM sys.dm_pdw_nodes_os_performance_counters
WHERE
instance_name like ‘Distribution_%’
AND counter_name = ‘Log File(s) Used Size (KB)’

Selected Answer: D
D is totally correct. Link has this very clearly mentioned

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-monitor

170
Q

You are designing an anomaly detection solution for streaming data from an Azure IoT hub. The solution must meet the following requirements:
✑ Send the output to Azure Synapse.
✑ Identify spikes and dips in time series data.
✑ Minimize development and configuration effort.

Which should you include in the solution?
A. Azure Databricks
B. Azure Stream Analytics
C. Azure SQL Database

A

Correct Answer: B 🗳️
You can identify anomalies by routing data via IoT Hub to a built-in ML model in Azure Stream Analytics.
Reference:
https://docs.microsoft.com/en-us/learn/modules/data-anomaly-detection-using-azure-iot-hub/

171
Q

A company uses Azure Stream Analytics to monitor devices.

The company plans to double the number of devices that are monitored.

You need to monitor a Stream Analytics job to ensure that there are enough processing resources to handle the additional load.

Which metric should you monitor?
A. Early Input Events
B. Late Input Events
C. Watermark delay
D. Input Deserialization Errors

A

Correct Answer: C 🗳️

C–>it measures the amount of delay in the processing of the input events. If the watermark delay increases, it could indicate that the Stream Analytics job is not able to keep up with the incoming data and may not have enough processing resources to handle the additional load.

There are a number of resource constraints that can cause the streaming pipeline to slow down. The watermark delay metric can rise due to:
✑ Not enough processing resources in Stream Analytics to handle the volume of input events.
✑ Not enough throughput within the input event brokers, so they are throttled.
✑ Output sinks are not provisioned with enough capacity, so they are throttled. The possible solutions vary widely based on the flavor of output service being used.
Incorrect Answers:
A: Deserialization issues are caused when the input stream of your Stream Analytics job contains malformed messages.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-time-handling

172
Q

HOTSPOT -
You are designing an enterprise data warehouse in Azure Synapse Analytics that will store website traffic analytics in a star schema.
You plan to have a fact table for website visits. The table will be approximately 5 GB.

You need to recommend which distribution type and index type to use for the table. The solution must provide the fastest query performance.

What should you recommend? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Distribution:
* Hash
* Round robin
* Replicated

Index:
* Clustered columnstore
* Clustered
* Nonclustered

A

Box 1: Hash -
Consider using a hash-distributed table when:
The table size on disk is more than 2 GB.
The table has frequent insert, update, and delete operations.

Box 2: Clustered columnstore -
Clustered columnstore tables offer both the highest level of data compression and the best overall query performance.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index

173
Q

You have an Azure Stream Analytics job.

You need to ensure that the job has enough streaming units provisioned.
You configure monitoring of the SU % Utilization metric.

Which two additional metrics should you monitor? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.
A. Backlogged Input Events
B. Watermark Delay
C. Function Events
D. Out of order Events
E. Late Input Events

A

Correct Answer: AB 🗳️
To react to increased workloads and increase streaming units, consider setting an alert of 80% on the SU Utilization metric. Also, you can use watermark delay and backlogged events metrics to see if there is an impact.
Note: Backlogged Input Events: Number of input events that are backlogged. A non-zero value for this metric implies that your job isn’t able to keep up with the number of incoming events. If this value is slowly increasing or consistently non-zero, you should scale out your job, by increasing the SUs.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-monitoring

174
Q

You have an activity in an Azure Data Factory pipeline. The activity calls a stored procedure in a data warehouse in Azure Synapse Analytics and runs daily.

You need to verify the duration of the activity when it ran last.
What should you use?
A. activity runs in Azure Monitor
B. Activity log in Azure Synapse Analytics
C. the sys.dm_pdw_wait_stats data management view in Azure Synapse Analytics
D. an Azure Resource Manager template

A

Correct Answer: A 🗳️
Monitor activity runs. To get a detailed view of the individual activity runs of a specific pipeline run, click on the pipeline name.
Example:

The list view shows activity runs that correspond to each pipeline run. Hover over the specific activity run to get run-specific information such as the JSON input,
JSON output, and detailed activity-specific monitoring experiences.

You can check the Duration.
Incorrect Answers:
C: sys.dm_pdw_wait_stats holds information related to the SQL Server OS state related to instances running on the different nodes.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/monitor-visually

175
Q

You have an Azure Data Factory pipeline that is triggered hourly.

The pipeline has had 100% success for the past seven days.

The pipeline execution fails, and two retries that occur 15 minutes apart also fail. The third failure returns the following error.

ErrorCode=UserErrorFileNotFound,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'NotFound'. Account: 'contosoproduksouth'. Filesystem: wwi. Path: 'BIKES/CARBON/year=2021/month=01/day=10/hour=06'. ErrorCode: 'PathNotFound'. Message: 'The specified path does not exist.'. RequestId: '6d269b78-901f-001b-4924-e7a7bc000000'. TimeStamp: 'Sun, 10 Jan 2021 07:45:05

What is a possible cause of the error?
A. The parameter used to generate year=2021/month=01/day=10/hour=06 was incorrect.
B. From 06:00 to 07:00 on January 10, 2021, there was no data in wwi/BIKES/CARBON.
C. From 06:00 to 07:00 on January 10, 2021, the file format of data in wwi/BIKES/CARBON was incorrect.
D. The pipeline was triggered too early.

A

The error message says a missing file, which matches with answer B: missing data from 06:00. The process had re-tried three times, 15 mins apart, which explains that the error was generated 07:45.

176
Q

HARD
You have an Azure Synapse Analytics job that uses Scala.
You need to view the status of the job.
What should you do?
A. From Synapse Studio, select the workspace. From Monitor, select SQL requests.
B. From Azure Monitor, run a Kusto query against the AzureDiagnostics table.
C. From Synapse Studio, select the workspace. From Monitor, select Apache Sparks applications.
D. From Azure Monitor, run a Kusto query against the SparkLoggingEvent_CL table.

A

almost always azure monitor but not this time
Correct Answer: C 🗳️
Use Synapse Studio to monitor your Apache Spark applications. To monitor running Apache Spark application Open Monitor, then select Apache Spark applications. To view the details about the Apache Spark applications that are running, select the submitting Apache Spark application and view the details. If the
Apache Spark application is still running, you can monitor the progress.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/monitoring/apache-spark-applications

177
Q

DRAG DROP -
You have an Azure Data Lake Storage Gen2 account that contains a JSON file for customers. The file contains two attributes named FirstName and LastName.

You need to copy the data from the JSON file to an Azure Synapse Analytics table by using Azure Databricks. A new column must be created that concatenates the FirstName and LastName values.
You create the following components:
✑ A destination table in Azure Synapse
✑ An Azure Blob storage container
✑ A service principal

In which order should you perform the actions? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.
Select and Place:

Actions
* Mount the Data Lake Storage onto DBFS.
* Write the results to a table in Azure Synapse.
* Specify a temporary folder to stage the data.
* Read the file into a data frame.
* Perform transformations on the data frame.

A

Step 1: Mount the Data Lake Storage onto DBFS
Begin with creating a file system in the Azure Data Lake Storage Gen2 account.
Step 2: Read the file into a data frame.
You can load the json files as a data frame in Azure Databricks.
Step 3: Perform transformations on the data frame.
Step 4: Specify a temporary folder to stage the data
Specify a temporary folder to use while moving data between Azure Databricks and Azure Synapse.
Step 5: Write the results to a table in Azure Synapse.
You upload the transformed data frame into Azure Synapse. You use the Azure Synapse connector for Azure Databricks to directly upload a dataframe as a table in a Azure Synapse.
Reference:
https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse

178
Q

hard (bad question)

You have an Azure data factory named ADF1.

You currently publish all pipeline authoring changes directly to ADF1.

You need to implement version control for the changes made to pipeline artifacts. The solution must ensure that you can apply version control to the resources currently defined in the UX Authoring canvas for ADF1.

Which two actions should you perform? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A. From the UX Authoring canvas, select Set up code repository.
B. Create a Git repository.
C. Create a GitHub action.
D. Create an Azure Data Factory trigger.
E. From the UX Authoring canvas, select Publish.
F. From the UX Authoring canvas, run Publish All.

A

Selected Answer: AB
They are asking to “implement version control”.
B Create Git repo
A From the UX Set up code repository

179
Q

DRAG DROP -
You have an Azure subscription that contains an Azure Synapse Analytics workspace named workspace1. Workspace1 connects to an Azure DevOps repository named repo1.

Repo1 contains a collaboration branch named main and a development branch named branch1. Branch1 contains an Azure Synapse pipeline named pipeline1.

In workspace1, you complete testing of pipeline1.

You need to schedule pipeline1 to run daily at 6 AM.

Which four actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select.
Select and Place:

Actions
* Create a new branch in Repo1.
* Merge the changes from branch1 into main.
* Associate the schedule trigger with pipeline1.
* Switch to Synapse live mode.
* Create a schedule trigger.
* Publish the contents of main.

A

Create a schedule trigger.

Associate the schedule trigger with pipeline1.

Merge the changes from branch1 into main.

Publish the contents of main.

you should associate the trigger before merge the code into main, because schedule also is part of code. all code store in main, do not change it directly, that is the purpose of version control.

180
Q

HOTSPOT -
You have an Azure subscription that contains an Azure Synapse Analytics dedicated SQL pool named Pool1 and an Azure Data Lake Storage account named storage1. Storage1 requires secure transfers.

You need to create an external data source in Pool1 that will be used to read .orc files in storage1.

How should you complete the code? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

CREATE EXTERNAL DATA SOURCE AzureDataLakeStore   
WITH   
Location1  ' [ abfs /  abfss  / wasb  / wasbs]://data@newyorktaxidataset.dfs.core.windows.net' ,   
(   
credential = ADLS_credential ,   
TУPЕ   [ BLOB_STORAGE  / HADOOP  / RDBMS /  SHARP MAP MANAGER   ]
);   
A

Answer: abfss and Hadoop
Hint: Storage1 requires secure transfers –> The default option is to use enable secure SSL connections when provisioning Azure Data Lake Storage Gen2. When this is enabled, you must use abfss when a secure TLS/SSL connection is selected.

Reference: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=azure-sqldw-latest&preserve-view=true&tabs=dedicated

181
Q

You have an Azure subscription that contains an Azure Synapse Analytics dedicated SQL pool named SQLPool1.

SQLPool1 is currently paused.

You need to restore the current state of SQLPool1 to a new SQL pool.

What should you do first?
A. Create a workspace.
B. Create a user-defined restore point.
C. Resume SQLPool1.
D. Create a new SQL pool.

A

Selected Answer: C
You wont be able to create restore point when the SQL pool is paused. The the correct answer is Result SQL Pool. See below from Microsoft documentation.

User-defined restore points can also be created through Azure portal.

Sign in to your Azure portal account.

Navigate to the dedicated SQL pool (formerly SQL DW) that you want to create a restore point for.

Select Overview from the left pane, select + New Restore Point. If the New Restore Point button isn’t enabled, make sure that the dedicated SQL pool (formerly SQL DW) isn’t paused.

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-restore-points

182
Q

You are designing an Azure Synapse Analytics workspace.
You need to recommend a solution to provide double encryption of all the data at rest.

Which two components should you include in the recommendation? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.
A. an X.509 certificate
B. an RSA key
C. an Azure virtual network that has a network security group (NSG)
D. an Azure Policy initiative
E. an Azure key vault that has purge protection enabled

A

Correct Answer: BE 🗳️
Synapse workspaces encryption uses existing keys or new keys generated in Azure Key Vault. A single key is used to encrypt all the data in a workspace.
Synapse workspaces support RSA 2048 and 3072 byte-sized keys, and RSA-HSM keys.
The Key Vault itself needs to have purge protection enabled.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/security/workspaces-encryption

183
Q

You have an Azure Synapse Analytics serverless SQL pool named Pool1 and an Azure Data Lake Storage Gen2 account named storage1.

The AllowBlobPublicAccess property is disabled for storage1.

You need to create an external data source that can be used by Azure Active Directory (Azure AD) users to access storage from Pool1.

What should you create first?
A. an external resource pool
B. an external library
C. database scoped credentials
D. a remote service binding

A

Correct Answer: C 🗳️
Security -
User must have SELECT permission on an external table to read the data. External tables access underlying Azure storage using the database scoped credential defined in data source.
Note: A database scoped credential is a record that contains the authentication information that is required to connect to a resource outside SQL Server. Most credentials include a Windows user and password.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables https://docs.microsoft.com/en-us/sql/t-sql/statements/create-database-scoped-credential-transact-sql

184
Q

You have an Azure Data Factory pipeline named Pipeline1. Pipeline1 contains a copy activity that sends data to an Azure Data Lake Storage Gen2 account.
Pipeline1 is executed by a schedule trigger.

You change the copy activity sink to a new storage account and merge the changes into the collaboration branch.

After Pipeline1 executes, you discover that data is NOT copied to the new storage account.

You need to ensure that the data is copied to the new storage account.
What should you do?
A. Publish from the collaboration branch.
B. Create a pull request.
C. Modify the schedule trigger.
D. Configure the change feed of the new storage account.

A

Correct Answer: A 🗳️
CI/CD lifecycle -
1. A development data factory is created and configured with Azure Repos Git. All developers should have permission to author Data Factory resources like pipelines and datasets.
2. A developer creates a feature branch to make a change. They debug their pipeline runs with their most recent changes
3. After a developer is satisfied with their changes, they create a pull request from their feature branch to the main or collaboration branch to get their changes reviewed by peers.
4. After a pull request is approved and changes are merged in the main branch, the changes get published to the development factory.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-delivery

185
Q

You have an Azure Data Factory pipeline named pipeline1 that is invoked by a tumbling window trigger named Trigger1. Trigger1 has a recurrence of 60 minutes.

You need to ensure that pipeline1 will execute only if the previous execution completes successfully.

How should you configure the self-dependency for Trigger1?
A. offset: “-00:01:00” size: “00:01:00”
B. offset: “01:00:00” size: “-01:00:00”
C. offset: “01:00:00” size: “01:00:00”
D. offset: “-01:00:00” size: “01:00:00”

A

Correct Answer: D 🗳️

Correct Answer: D
Offset of “-01:00:00” indicates to start the next trigger instance only after the previous trigger instance completes, and size of “01:00:00” indicates to wait for 1 hour after the previous trigger instance completes before starting the next one.

Tumbling window self-dependency properties
In scenarios where the trigger shouldn’t proceed to the next window until the preceding window is successfully completed, build a self-dependency. A self- dependency trigger that’s dependent on the success of earlier runs of itself within the preceding hour will have the properties indicated in the following code.
Example code:
“name”: “DemoSelfDependency”,
“properties”: {
“runtimeState”: “Started”,
“pipeline”: {
“pipelineReference”: {
“referenceName”: “Demo”,
“type”: “PipelineReference”
}
},
“type”: “TumblingWindowTrigger”,
“typeProperties”: {
“frequency”: “Hour”,
“interval”: 1,
“startTime”: “2018-10-04T00:00:00Z”,
“delay”: “00:01:00”,
“maxConcurrency”: 50,
“retryPolicy”: {
“intervalInSeconds”: 30
},
“dependsOn”: [
{
“type”: “SelfDependencyTumblingWindowTriggerReference”,
“size”: “01:00:00”,
“offset”: “-01:00:00”
}
]
}
}
}
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/tumbling-window-trigger-dependency

186
Q

HOTSPOT -
You have an Azure Synapse Analytics pipeline named Pipeline1 that contains a data flow activity named Dataflow1.

Pipeline1 retrieves files from an Azure Data Lake Storage Gen 2 account named storage1.

Dataflow1 uses the AutoResolveIntegrationRuntime integration runtime configured with a core count of 128.

You need to optimize the number of cores used by Dataflow1 to accommodate the size of the files in storage1.

What should you configure? To answer, select the appropriate options in the answer area.

Hot Area:
To Pipeline1, add:
* A custom activity
* A Get Metadata activity
* An If Condition activity

For Dataflow1, set the core count by using:
* Dynamic content
* Parameters
* User properties

A

Box 1: A Get Metadata activity -
Dynamically size data flow compute at runtime
The Core Count and Compute Type properties can be set dynamically to adjust to the size of your incoming source data at runtime. Use pipeline activities like
Lookup or Get Metadata in order to find the size of the source dataset data. Then, use Add Dynamic Content in the Data Flow activity properties.

Box 2: Dynamic content -

Correct :
Use pipeline activities like Lookup or Get Metadata in order to find the size of the source dataset data. Then, use Add Dynamic Content in the Data Flow activity properties. You can choose small, medium, or large compute sizes. Optionally, pick “Custom” and configure the compute types and number of cores manually.

To optimize the number of cores used by Dataflow1 based on the size of the files in storage1:

For Pipeline1, add:
- A Get Metadata activity
This activity helps retrieve metadata about the files in storage1, allowing you to assess their size and properties.

For Dataflow1, set the core count by using:
- Dynamic content
By using dynamic content, you can dynamically adjust the core count of Dataflow1 based on the metadata retrieved from the Get Metadata activity in Pipeline1. This dynamic adjustment will enable you to optimize the core count according to the size or properties of the files present in storage1.

Reference:
https://docs.microsoft.com/en-us/azure/data-factory/control-flow-execute-data-flow-activity

187
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:
✑ A workload for data engineers who will use Python and SQL.
✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.
✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.

The enterprise architecture team at your company identifies the following standards for Databricks environments:
✑ The data engineers must share a cluster.
✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster.
✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists.

You need to create the Databricks clusters for the workloads.

Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the jobs.

Does this meet the goal?
A. Yes
B. No

A

A. Yes
- - data engineers: high concurrency cluster
- jobs: Standard cluster
- data scientists: Standard cluster

188
Q

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:
✑ A workload for data engineers who will use Python and SQL.
✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.
✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.

The enterprise architecture team at your company identifies the following standards for Databricks environments:
✑ The data engineers must share a cluster.
✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster.
✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists.

You need to create the Databricks clusters for the workloads.

Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a High Concurrency cluster for the jobs.

Does this meet the goal?
A. Yes
B. No

A

High-concurrency clusters do not support Scala. So the answer is still ‘No’ but the reasoning is wrong.
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure

Variations of this questions
All instances of this question should be Standard, High, Standard

189
Q

You are designing a folder structure for the files in an Azure Data Lake Storage Gen2 account. The account has one container that contains three years of data.

You need to recommend a folder structure that meets the following requirements:
✑ Supports partition elimination for queries by Azure Synapse Analytics serverless SQL pools
✑ Supports fast data retrieval for data from the current month
✑ Simplifies data security management by department

Which folder structure should you recommend?
A. \Department\DataSource\YYYY\MM\DataFile_YYYYMMDD.parquet
B. \DataSource\Department\YYYYMM\DataFile_YYYYMMDD.parquet
C. \DD\MM\YYYY\Department\DataSource\DataFile_DDMMYY.parquet
D. \YYYY\MM\DD\Department\DataSource\DataFile_YYYYMMDD.parquet

A

Correct Answer: A 🗳️
Department top level in the hierarchy to simplify security management.
Month (MM) at the leaf/bottom level to support fast data retrieval for data from the current month.

190
Q

link
You have an Azure subscription that contains an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 receives new data once every 24 hours.

You have the following function.
~~~
create function dbo.udfFtoC(F decimal)
return decimal
as
begin
return (F - 32) * 5.0 / 9
end
~~~

You have the following query.

select avg_date, sensorid, avg_f, dbo.udfFtoC(avg_temperature) as avg_c from SensorTemps
where avg_date = @parameter

The query is executed once every 15 minutes and the @parameter value is set to the current date.

You need to minimize the time it takes for the query to return results.

Which two actions should you perform? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.
A. Create an index on the avg_f column.

B. Convert the avg_c column into a calculated column.

C. Create an index on the sensorid column.

D. Enable result set caching.

E. Change the table distribution to replicate.

A

Correct Answer: BD 🗳️

Selected Answer: BD
B. Convert the avg_c column into a calculated column.
D. Enable result set caching.

*Explanation:
A calculated column is a column that uses an expression to calculate its value based on other columns in the same table. In this case, the udfFtoC function can be used to calculate the avgc value based on the avgtemperature column, eliminating the need to call the UDF in the SELECT statement.

Enabling result set caching can improve query performance by caching the result set of the query, so subsequent queries that use the same parameters can be retrieved from the cache instead of executing the query again.

Creating an index on the avgf column or the sensorid column is not useful because there are no join or filter conditions on these columns in the WHERE clause. Changing the table distribution to replicate is also not necessary because it does not affect the query performance in this scenario*

D: When result set caching is enabled, dedicated SQL pool automatically caches query results in the user database for repetitive use. This allows subsequent query executions to get results directly from the persisted cache so recomputation is not needed. Result set caching improves query performance and reduces compute resource usage. In addition, queries using cached results set do not use any concurrency slots and thus do not count against existing concurrency limits.
Incorrect:
Not A, not C: No joins so index not helpful.
Not E: What is a replicated table?
A replicated table has a full copy of the table accessible on each Compute node. Replicating a table removes the need to transfer data among Compute nodes before a join or aggregation. Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed. 2 GB is not a hard limit. If the data is static and does not change, you can replicate larger tables.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-result-set-caching

191
Q

You need to design a solution that will process streaming data from an Azure Event Hub and output the data to Azure Data Lake Storage.

The solution must ensure that analysts can interactively query the streaming data.

What should you use?

A. Azure Stream Analytics and Azure Synapse notebooks
B. Structured Streaming in Azure Databricks
C. event triggers in Azure Data Factory
D. Azure Queue storage and read-access geo-redundant storage (RA-GRS)

A

To process streaming data from Azure Event Hub and enable interactive querying for analysts while storing the data in Azure Data Lake Storage, the most appropriate option is:

B. Structured Streaming in Azure Databricks

Azure Databricks, with its capability for Structured Streaming, provides a powerful platform for processing real-time streaming data. It offers a unified analytics platform that integrates with Azure services, allowing the ingestion of data from Azure Event Hubs.

Structured Streaming provides a high-level API for building continuous and interactive data processing pipelines over streaming data. By utilizing Azure Databricks notebooks, analysts can interactively query and analyze the data in real-time as it flows through the pipeline.

This option enables continuous data processing, real-time querying, and seamless integration with Azure Data Lake Storage, making it an effective choice for the specified requirements.

192
Q

You are creating an Apache Spark job in Azure Databricks that will ingest JSON-formatted data.

You need to convert a nested JSON string into a DataFrame that will contain multiple rows.

Which Spark SQL function should you use?

A. explode
B. filter
C. coalesce
D. extract

A

The Spark SQL function used to explode a nested JSON array into multiple rows is:

A. explode

The explode function in Spark SQL is specifically designed to flatten arrays in DataFrames. When applied to a column containing nested JSON arrays, it expands each element of the array into its own row, enabling processing and querying of individual elements within the nested structure.

193
Q

link with imgs
DRAG DROP
You have an Azure subscription that contains an Azure Databricks workspace. The workspace contains a notebook named Notebook1.

In Notebook1, you create an Apache Spark DataFrame named df_sales that contains the following columns:

  • Customer
  • SalesPerson
  • Region
  • Amount

You need to identify the three top performing salespersons by amount for a region named HQ.

How should you complete the query? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.

Values
* agg(col(‘SalesPerson’))
* filter(col(‘SalesPerson*))
* groupBy(col(‘SalesPerson’))
* groupBy(col(‘TotalAmount’))
* orderBy(col(‘TotalAmount’))
* orderBy(desc(‘TotalAmount’))

df_sales.filter(col('Region') == 'HQ').    [XXXXXXXXXX]
.agg(sum('Amount').alias('TotalAmount')).  [XXXXXXXXXX] .limit(3)
A

df_sales.filter(col(“Region”) == “HQ”)
.groupBy(col(‘SalesPerson’))
.agg(sum(‘Amount’).alias(‘TotalAmount’))
.orderBy(desc(‘TotalAmount’))
.limit(3)

194
Q

You need to schedule an Azure Data Factory pipeline to execute when a new file arrives in an Azure Data Lake Storage Gen2 container.

Which type of trigger should you use?

A. on-demand
B. tumbling window
C. schedule
D. storage event

A

Correct Answer: D 🗳️
Community vote distribution

195
Q

DRAG DROP
You have a project in Azure DevOps that contains a repository named Repo1. Repo1 contains a branch named main.

You create a new Azure Synapse workspace named Workspace1.

You need to create data processing pipelines in Workspace1. The solution must meet the following requirements:

  • Pipeline artifacts must be stored in Repo1
  • Source control must be provided for pipeline artifacts.
  • All development must be performed in a feature branch.

Which four actions should you perform in sequence in Synapse Studio? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

Actions
* Create pipeline artifacts and save them in the main branch.
* Set the main branch as the collaboration branch.
* Create a pull request to merge the contents of the main branch into the new branch.
* Create pipeline artifacts and save them in the new branch.
* Create a new branch.
* Configure a code repository and select Repo1.

A

Configure a code repo and select Repo1
Create a new brach
Create pipeline artifacts and save them in the new branch
Create pipeline artifacts and save them in the new branch.

196
Q

You have an Azure subscription that contains an Azure SQL database named DB1 and a storage account named storage1. The storage1 account contains a file named File1.txt. File1.txt contains the names of selected tables in DB1.

You need to use an Azure Synapse pipeline to copy data from the selected tables in DB1 to the files in storage1. The solution must meet the following requirements:

  • The Copy activity in the pipeline must be parameterized to use the data in File1.txt to identify the source and destination of the copy.
  • Copy activities must occur in parallel as often as possible.

Which two pipeline activities should you include in the pipeline? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A. Get Metadata
B. Lookup
C. ForEach
D. If Condition

A

Selected Answer: BC
B. Lookup: The Lookup activity can be used to read the contents of File1.txt from the storage account. It will retrieve the names of selected tables in DB1 as parameter values for the Copy activity.

C. ForEach: The ForEach activity can be used to iterate over the retrieved table names from File1.txt. Inside the loop, you can configure the Copy activity with the source and destination information based on the current table name.

197
Q

Hard
You have an Azure data factory that connects to a Microsoft Purview account. The data factory is registered in Microsoft Purview.

You update a Data Factory pipeline.

You need to ensure that the updated lineage is available in Microsoft Purview.

What should you do first?

A. Disconnect the Microsoft Purview account from the data factory.
B. Execute the pipeline.
C. Execute an Azure DevOps build pipeline.
D. Locate the related asset in the Microsoft Purview portal.

A

To ensure the updated lineage information from the Data Factory pipeline is available in Microsoft Purview, you should execute the pipeline first (option B). Executing the pipeline triggers the data movement or processing defined within it. This action generates or updates the lineage information associated with the pipeline activities, allowing these changes to reflect in the connected Microsoft Purview account.

198
Q

You have a Microsoft Purview account.

The Lineage view of a CSV file is shown in the following exhibit.

SEE SITE FOR IMG

How is the data for the lineage populated?

A. manually
B. by scanning data stores
C. by executing a Data Factory pipeline

A

Answer is C
Find reason here: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-push-lineage-to-purview#run-pipeline-and-push-lineage-data-to-microsoft-purview

The answer is also displayed on the top right corner of the image displayed.

199
Q

You have an Azure subscription that contains a Microsoft Purview account named MP1, an Azure data factory named DF1, and a storage account named storage1. MP1 is configured to scan storage1. DF1 is connected to MP1 and contains a dataset named DS1. DS1 references a file in storage1.

In DF1, you plan to create a pipeline that will process data from DS1.

You need to review the schema and lineage information in MP1 for the data referenced by DS1.

Which two features can you use to locate the information? Each correct answer presents a complete solution.

NOTE: Each correct answer is worth one point.
A. the search bar in the Microsoft Purview governance portal
B. the Storage browser of storage1 in the Azure portal
C. the search bar in the Azure portal
D. the search bar in Azure Data Factory Studio

A

Going against discussion with this anwser
To review the schema and lineage information in Microsoft Purview for the data referenced by DS1 in DF1, you can use the following:

A. The search bar in the Microsoft Purview governance portali s a suitable feature to locate and review the schema and lineage information associated with the data referenced by DS1. This portal provides extensive search capabilities for discovering and exploring metadata, including lineage.

B. The Storage browser of storage1 in the Azure portal primarily deals with storage account-related management and browsing data within the storage account. While it allows access to the files and containers within storage1, it might not provide detailed lineage and schema information available in Microsoft Purview.

Therefore, options A (the search bar in the Microsoft Purview governance portal) and B (the Storage browser of storage1 in the Azure portal) are the features you can use to locate the schema and lineage information for the data referenced by DS1 in MP1.

200
Q

HOTSPOT

You have an Azure Blob storage account that contains a folder. The folder contains 120,000 files. Each file contains 62 columns.

Each day, 1,500 new files are added to the folder.

You plan to incrementally load five data columns from each new file into an Azure Synapse Analytics workspace.

You need to minimize how long it takes to perform the incremental loads.

What should you use to store the files and in which format? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Storage:
* Multiple blob storage accounts
* Multiple containers in the blob storage account
* Timeslice partitioning in the folders

Format:
* Apache Parquet
* CSV
* JSON

A

Storage: Timeslice partitioning in the folders
Format: Apache Parquet

Time partitioning is correct as the fastest way to load only new files, but requires that the timeslice information be part of the file or folder name (https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview)

However, Parquet is the correct file format since it’s a columnar format

For minimizing the load time and optimizing the incremental loads for an Azure Synapse Analytics workspace, you should consider using the following configurations:

Storage:

Multiple containers in the blob storage account: This allows for better organization and management of files. It doesn’t necessarily speed up the load process but contributes to better organization, which can indirectly impact performance.
Timeslice partitioning in the folders: This partitioning strategy helps in organizing data by time, which can significantly speed up incremental loads by enabling the system to efficiently locate and ingest only the new or changed data.
Format:

Apache Parquet: Parquet is a columnar storage format that offers great performance benefits, especially for analytics workloads. It provides efficient compression and speeds up query performance by minimizing the data read during query execution. This format is ideal for incremental loads and analytics operations.
Therefore, the optimal choices for storage and format to minimize load time and optimize incremental loads would be:

201
Q

link

You are batch loading a table in an Azure Synapse Analytics dedicated SQL pool.

You need to load data from a staging table to the target table. The solution must ensure that if an error occurs while loading the data to the target table, all the inserts in that batch are undone.

How should you complete the Transact-SQL code? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.

Values
* BEGIN DISTRIBUTED TRANSACTION
* BEGIN TRAN
* COMMIT TRAN
* ROLLBACK TRAN
* SET RESULT_SET_CACHING ON

[XXXXXXXXXXX]
BEGIN TRY
     INSERT INTO dbo. Table1 (col1, col2, col3)
     SELECT col1, col2, col3 FROM stage.Table1;
END TRY
     BEGIN CATCH
          IF @@TRANCOUNT > 0
          BEGIN
                    [XXXXXXXXXXX]
          END
                    END CATCH;
IF @@TRANCOUNT >0
BEGIN
     COMMIT TRAN;
END
A

BEGIN TRAN
ROLLBACK TRAN

Given answer is wrong. It should be BEGIN TRAN as SQL pool in Azure Synapse Analytics does not support distributed transaction.

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-transactions

“Limitations

SQL pool does have a few other restrictions that relate to transactions.

They are as follows:

No distributed transactions
No nested transactions permitted
No save points allowed
No named transactions
No marked transactions
No support for DDL such as CREATE TABLE inside a user-defined transaction

Distributed Transactions are only allowed in SQL Server and Azure SQL Managed Instance:

https://learn.microsoft.com/de-de/sql/t-sql/language-elements/begin-distributed-transaction-transact-sql?view=sql-server-ver16

202
Q

Hard
HOTSPOT

You have two Azure SQL databases named DB1 and DB2.

DB1 contains a table named Table1. Table1 contains a timestamp column named LastModifiedOn. LastModifiedOn contains the timestamp of the most recent update for each individual row.

DB2 contains a table named Watermark. Watermark contains a single timestamp column named WatermarkValue.

You plan to create an Azure Data Factory pipeline that will incrementally upload into Azure Blob Storage all the rows in Table1 for which the LastModifiedOn column contains a timestamp newer than the most recent value of the WatermarkValue column in Watermark.

You need to identify which activities to include in the pipeline. The solution must meet the following requirements:

  • Minimize the effort to author the pipeline.
  • Ensure that the number of data integration units allocated to the upload operation can be controlled.

What should you identify? To answer, select the appropriate options in the answer area.

NOTE: Each correct answer is worth one point.

To retrieve the watermark value, use:
* Filter
* Get Metadata
* Lookup

To perform the upload, use:
* Copy data
* Custom
* Data flow

A

For retrieving the watermark value, you should use the Lookup activity to fetch the most recent WatermarkValue from the Watermark table in DB2.

For performing the upload based on the watermark value, you should use the Copy data activity. This activity will handle the incremental loading of data from Table1 in DB1 to Azure Blob Storage, utilizing the watermark value retrieved from the Lookup activity.

So, the identified activities are:

To retrieve the watermark value: Lookup
To perform the upload: Copy data

203
Q

HARD HARD Question
HOTSPOT

You have an Azure Synapse serverless SQL pool.

You need to read JSON documents from a file by using the OPENROWSET function.

How should you complete the query? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

SELECT *
FROM OPENROWSET
(
      BULK
      'https://sourcedatalake.blob.core.windows.net/public/docs.json',
      FORMAT = [ "CSV / DELTA' / 'JSON' / "PARQUET' ]
      FIELDTERMINATOR = '0xØb',
      FIELDQUOTE = [ '0x09' / 'OxOa' / "Ox0b / 'OxOc' ]
      ROWTERMINATOR = '0x0b'
)
WITH (jsondoc nvarchar(max) AS JsonDocuments
A

Correct. It’s weird but best way to open a json is as a csv and with 0x0b for fieldterminator and fieldquote.
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files

204
Q

hard
You use Azure Data Factory to create data pipelines.

You are evaluating whether to integrate Data Factory and GitHub for source and version control.

What are two advantages of the integration? Each correct answer presents a complete solution.

NOTE: Each correct selection is worth one point.

A. additional triggers
B. lower pipeline execution times
C. the ability to save without publishing
D. the ability to save pipelines that have validation issues

A

C. the ability to save without publishing
D. the ability to save pipelines that have validation issues
When you integrate Data Factory and GitHub, you can save your pipelines to a GitHub repository without publishing them to Azure. This allows you to work on your pipelines in a development environment and then publish them to Azure when you are ready.

You can also save pipelines that have validation issues. This is because GitHub does not validate your pipelines when you save them. This allows you to work on your pipelines and fix the validation issues before you publish them to Azure.

205
Q

You have an Azure Synapse Analytics workspace named Workspace1.

You perform the following changes:
* Implement source control for Workspace1.
* Create a branch named Feature based on the collaboration branch.
* Switch to the Feature branch.
* Modify Workspace1.

You need to publish the changes to Azure Synapse.

From which branch should you perform each change? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point

Branches
* Collaboration
* Publish
* Feature

Create a pull request: [xxxxxxxxx]

Publish the changes: [xxxxxxxxx]

A

Create a pull request: Feature

Publish the changes: Collaboration

206
Q

HARD
You have two Azure Blob Storage accounts named account1 and account2.

You plan to create an Azure Data Factory pipeline that will use scheduled intervals to replicate newly created or modified blobs from account1 to account2.

You need to recommend a solution to implement the pipeline. The solution must meet the following requirements:
* Ensure that the pipeline only copies blobs that were created or modified since the most recent replication event.
* Minimize the effort to create the pipeline.

What should you recommend?

A. Run the Copy Data tool and select Metadata-driven copy task.
B. Create a pipeline that contains a Data Flow activity.
C. Create a pipeline that contains a flowlet.
D. Run the Copy Data tool and select Built-in copy task.

A

Selected Answer: D
Just use Built-in copy task, according to: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-lastmodified-copy-data-tool

207
Q

You have an Azure Data Factory pipeline named pipeline1 that contains a data flow activity named activity1.

You need to run pipeline1.

Which runtime will be used to run activity1?

A. Azure Integration runtime
B. Self-hosted integration runtime
C. SSIS integration runtime

A

The runtime used to execute the Data Flow activity within an Azure Data Factory pipeline is the Azure Integration Runtime (IR). This runtime environment is provided by Azure Data Factory and is responsible for executing the data transformation logic defined in the Data Flow activity. So, the correct answer is:

A. Azure Integration runtime

208
Q

HOTSPOT
You have an Azure subscription that contains an Azure Synapse Analytics workspace named workspace1. Workspace1 contains a dedicated SQL pool named SQLPool1 and an Apache Spark pool named sparkpool1. Sparkpool1 contains a DataFrame named pyspark_df.

You need to write the contents of pyspark_df to a table in SQLPool1 by using a PySpark notebook.

How should you complete the code? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

pyspark_df.createOrReplaceTempView("pysparkdftemptable")
[ %%local / %%spark / %%sql ]
val scala_df = spark.sqlContext.sql ("select * from pysparkdftemptable")
scala_df.write.  [ jdbc / saveAsTable / synapsesql ] ("sqlpool1.dbo. PySparkTable", Constants. INTERNAL)
A

%%spark
&&
df.write.synapsesql

209
Q

hard
You have an Azure data factory named ADF1 and an Azure Synapse Analytics workspace that contains a pipeline named SynPipeLine1. SynPipeLine1 includes a Notebook activity.

You create a pipeline in ADF1 named ADFPipeline1.

You need to invoke SynPipeLine1 from ADFPipeline1.

Which type of activity should you use?

A. Web
B. Spark
C. Custom
D. Notebook

A

Correct Answer: A 🗳️
Community vote distribution

Selected Answer: A
Web Activity
https://learn.microsoft.com/en-us/azure/data-factory/solution-template-synapse-notebook

210
Q

Hard Hard not clear anwser
HOTSPOT
You have an Azure data factory that contains the linked service shown in the following exhibit.

image

Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.

NOTE: Each correct answer is worth one point.

When working in a feature branch, changes to the linked service will be published to the live service
* upon publishing the changes
* upon saving the changes
* when the changes are merged into the collaboration branch

A Copy activity that uses the linked service as the source will perform the Copy activity
* in the region of the data factory
* in the region of the selected external compute
* in the region of the source database

A

Debated - mostly confusing discussion - gonna go against chat
Anwser not clear after investigation

When working in a feature branch, changes to the linked service will be published to the live service

upon saving the changes

A Copy activity that uses the linked service as the source will perform the Copy activity

in the region of the source database

SOME CHAT SUGGEST
1. upon publishing changes to the service
2. in the region of data factory

According to Microsoft, AutoResolveIntegrationRuntime will attempt to use the sink location to get an IR in the same region (or the closest available) to execute the Copy activity, not the source location. I would go with the region of data factory, since that is the default option when the sink’s location is not detectable. Source: https://learn.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime#azure-ir-location

211
Q

In Azure Data Factory, you have a schedule trigger that is scheduled in Pacific Time.

Pacific Time observes daylight saving time.

The trigger has the following JSON file.

{
    "name": "Trigger 1",
    "properties": {
        "annotations": [],
        "runtimeState": "Started",
        "pipelines": [],
        "type": "ScheduleTrigger",
        "typeProperties": {
            "recurrence": {
                "frequency": "Week",
                "interval": 1,
                "startTime": "2022-08-05T04:00:00",
                "timeZone": "Pacific Standard Time",
                "schedule": {
                    "minutes": [0],
                    "hours": [3, 21],
                    "weekDays": [
                        "Sunday",
                        "Saturday"
                    ]
                }
            }
        }
    }
}

Use the drop-down menus to select the answer choice that completes each statement based on the information presented.

NOTE: Each correct selection is worth one point.

The trigger will execute [answer choice] on Sunday, March 3, 2024.
* one time
* two times
* zero times

The trigger [answer choice] daylight saving time.
* is unaffected by
* will automatically adjust for
* will require an adjustment for

A

1. two times
2. will automatically adjust
“For time zones that observe daylight saving, trigger time will auto-adjust for the twice a year change, if the recurrence is set to Days or above. To opt out of the daylight saving change, please select a time zone that does not observe daylight saving, for instance UTC.”

https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-schedule-trigger?tabs=data-factory#azure-data-factory-and-synapse-portal-experience

212
Q

link
You have an Azure Synapse Analytics dedicated SQL pool.

You need to create a pipeline that will execute a stored procedure in the dedicated SQL pool and use the returned result set as the input for a downstream activity. The solution must minimize development effort.

Which type of activity should you use in the pipeline?

A. U-SQL
B. Stored Procedure
C. Script
D. Notebook

A

Gonna go against chat- COULD BE B OR C

To execute a stored procedure in an Azure Synapse Analytics dedicated SQL pool and use the returned result set as input for a downstream activity, you should use the Stored Procedure activity in the pipeline. This activity is specifically designed to execute stored procedures within the dedicated SQL pool, making it the appropriate choice for this scenario.

CHAT CLAIMS
Selected Answer: C
For me the correct answer is C.
The store procedure activity doesn’t return any data.
In the description of the script activity is written that it can be used for : “Run stored procedures. If the SQL statement invokes a stored procedure that returns results from a temporary table, use the WITH RESULT SETS option to define metadata for the result set. “
https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-script

213
Q

You have an Azure SQL database named DB1 and an Azure Data Factory data pipeline named pipeline1.

From Data Factory, you configure a linked service to DB1.

In DB1, you create a stored procedure named SP1. SP1 returns a single row of data that has four columns.

You need to add an activity to pipeline1 to execute SP1. The solution must ensure that the values in the columns are stored as pipeline variables.

Which two types of activities can you use to execute SP1? Each correct answer presents a complete solution.

NOTE: Each correct selection is worth one point.

A. Script
B. Copy
C. Lookup
D. Stored Procedure

A

Disputed quite a lot, going with Chatgpt and 5/11 of chat

To execute a stored procedure (SP1) in Azure SQL Database and store the result set into pipeline variables, you can use the Lookup activity or the Stored Procedure activity in Azure Data Factory.

Both activities can be employed to run stored procedures. However, the Lookup activity is ideal when you want to retrieve a single row or a single value, which aligns with your scenario of obtaining a single row with four columns. Additionally, the Lookup activity can efficiently fetch the values from the stored procedure and store them as pipeline variables for further use in subsequent activities.

The Stored Procedure activity, on the other hand, can also execute the stored procedure, but it’s primarily used for performing actions in the SQL Database without returning results that are stored as pipeline variables. Hence, for your specific requirement of storing values as pipeline variables, the Lookup activity is the more suitable choice.

214
Q

You have an Azure data factory named ADF1.

You currently publish all pipeline authoring changes directly to ADF1.

You need to implement version control for the changes made to pipeline artifacts. The solution must ensure that you can apply version control to the resources currently defined in the Azure Data Factory Studio for ADF1.

Which two actions should you perform? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A. From the Azure Data Factory Studio, run Publish All.
B. Create an Azure Data Factory trigger.
C. Create a Git repository.
D. Create a GitHub action.
E. From the Azure Data Factory Studio, select Set up code repository.
F. From the Azure Data Factory Studio, select Publish.

A

To implement version control for the changes made to pipeline artifacts in Azure Data Factory (ADF1), you can perform the following actions:

C. Create a Git repository: Set up a Git repository where you’ll manage version control for your Azure Data Factory resources. This will serve as the central repository to store and track changes.

E. From the Azure Data Factory Studio, select Set up code repository: Within Azure Data Factory Studio, you can set up the link between your Data Factory and the Git repository you’ve created. This allows you to connect your Data Factory to the Git repository for version control.

These actions enable you to establish version control for your Azure Data Factory resources by connecting it to a Git repository, thereby allowing you to track changes, manage branches, and collaborate on changes with your team while maintaining version history for your artifacts.

215
Q

You have an Azure data factory named ADF1 that contains a pipeline named Pipeline1.

Pipeline1 must execute every 30 minutes with a 15-minute offset.

You need to create a trigger for Pipeline1. The trigger must meet the following requirements:

  • Backfill data from the beginning of the day to the current time.
  • If Pipeline1 fails, ensure that the pipeline can re-execute within the same 30-minute period.
  • Ensure that only one concurrent pipeline execution can occur.
  • Minimize development and configuration effort.

Which type of trigger should you create?

A. schedule
B. event-based
C. manual
D. tumbling window

A

For the requirements you’ve outlined, a Tumbling Window Trigger in Azure Data Factory would be the most appropriate choice:

D. Tumbling Window

Here’s why:

Backfill data from the beginning of the day to the current time: Tumbling window triggers can specify a start and end date/time. You can set it up to start at the beginning of the day and execute until the current time.

Re-execution on failure within the same 30-minute period: Tumbling window triggers allow you to define a recurrence interval. If the pipeline fails, it can re-execute within the same window, adhering to the specified schedule.

Ensure only one concurrent execution: You can configure a tumbling window trigger to allow only one concurrent run. This ensures that at any given time, only one instance of the pipeline runs.

Minimize development and configuration effort: Tumbling window triggers offer flexibility in defining recurring schedules and handling re-execution upon failure. This minimizes the need for custom logic or extensive configurations outside of the trigger definition.

216
Q

You have an Azure Data Lake Storage Gen2 account named account1 and an Azure event hub named Hub1. Data is written to account1 by using Event Hubs Capture.

You plan to query account by using an Apache Spark pool in Azure Synapse Analytics.

You need to create a notebook and ingest the data from account1. The solution must meet the following requirements:

  • Retrieve multiple rows of records in their entirety.
  • Minimize query execution time.
  • Minimize data processing.

Which data format should you use?

A. Parquet -
O. Avro
C. ORC
D. JSON

A

Correct Answer: A 🗳️

For querying data efficiently in an Apache Spark pool within Azure Synapse Analytics while focusing on minimizing query execution time and data processing, Parquet is the most suitable format among the options provided.

Here’s why:

Efficient Query Performance: Parquet is known for its columnar storage format, making it highly optimized for analytical queries. This format stores data in columnar fashion, which allows for skipping irrelevant data quickly and performing more targeted reads during queries.

Compression and Encoding: Parquet uses compression and encoding techniques, reducing the data size on disk. Smaller data size means less I/O overhead, which directly contributes to faster query execution.

Retrieval of Entire Records: Parquet’s columnar storage doesn’t prevent the retrieval of entire records. It allows efficient access to multiple rows and retrieves the required columns’ data in its entirety.

Considering these factors, Parquet aligns well with the requirements for minimizing query execution time and data processing while allowing the retrieval of multiple rows of records efficiently.

217
Q

You have an Azure Blob Storage account named blob1 and an Azure Data Factory pipeline named pipeline1.

You need to ensure that pipeline1 runs when a file is deleted from a container in blob1. The solution must minimize development effort.

Which type of trigger should you use?

A. schedule
B. storage event
C. tumbling window
D. custom event

A

Correct Answer: B 🗳️

The trigger type you need to use to initiate pipeline execution upon a file deletion event from a container in Azure Blob Storage is storage event.

Using a storage event trigger in Azure Data Factory allows you to monitor changes within your Azure Blob Storage containers, such as file deletions, and initiate pipeline executions in response to these specific events. This choice minimizes development effort as it directly responds to changes within the storage, triggering the pipeline without the need for custom coding or complex configurations.

218
Q

You have Azure Data Factory configured with Azure Repos Git integration. The collaboration branch and the publish branch are set to the default values.

You have a pipeline named pipeline1.

You build a new version of pipeline1 in a branch named feature1.

From the Data Factory Studio, you select Publish.

The source code of which branch will be built, and which branch will contain the output of the Azure Resource Manager (ARM) template? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Source code:
* adf_publish
* feature1
* main

ARM template output:
* adf_publish
* feature1
* main

A

Source code: main
ARM template output: adf_publish

219
Q

DRAG DROP -
You have an Azure Active Directory (Azure AD) tenant that contains a security group named Group1. You have an Azure Synapse Analytics dedicated SQL pool named dw1 that contains a schema named schema1.

You need to grant Group1 read-only permissions to all the tables and views in schema1. The solution must use the principle of least privilege.

Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select.

Select and Place:

Actions
* Create a database role named Role1 and grant Role1 SELECT permissions to schema1.
* Create a database role named Role1 and grant Role1 SELECT permissions to dw1.
* Assign the Azure role-based access control (Azure RBAC) Reader role for dw1 to Group1.
* Create a database user in dw1 that represents Group1 and uses the FROM EXTERNAL PROVIDER clause.
* Assign Role1 to the Group1 database user.

A

Here’s one possible sequence:

  1. Create a database user in dw1 that represents Group1 and uses the FROM EXTERNAL PROVIDER clause.
  2. Create a database role named Role1 and grant Role1 SELECT permissions to schema1.
  3. Assign Role1 to the Group1 database user.

This sequence involves creating a role granting read-only access to the specific schema, creating a user representing the Azure AD group, and finally assigning the role to that user to enable read access to the specified schema for the group.

220
Q

HOTSPOT -
You have an Azure subscription that contains a logical Microsoft SQL server named Server1. Server1 hosts an Azure Synapse Analytics SQL dedicated pool named Pool1.

You need to recommend a Transparent Data Encryption (TDE) solution for Server1.

The solution must meet the following requirements:
✑ Track the usage of encryption keys.
✑ Maintain the access of client apps to Pool1 in the event of an Azure datacenter outage that affects the availability of the encryption keys.

What should you include in the recommendation? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

To track encryption key usage:
* Always Encrypted
* TDE with customer-managed keys
* TDE with platform-managed keys

To maintain client app access in the event of a datacenter outage:
* Create and configure Azure key vaults in two Azure regions.
* Enable Advanced Data Security on Server1.
* Implement the client apps by using a Microsoft .NET Framework data provider.

A

Here are the correct choices for your requirements:

To track encryption key usage:
* TDE with customer-managed keys

To maintain client app access in the event of a datacenter outage:
* Create and configure Azure key vaults in two Azure regions.

These selections ensure encryption key tracking and provide control over encryption keys while maintaining client app access across regions in the event of a datacenter outage.

221
Q

#3

tough
You plan to create an Azure Synapse Analytics dedicated SQL pool.

You need to minimize the time it takes to identify queries that return confidential information as defined by the company’s data privacy regulations and the users who executed the queues.

Which two components should you include in the solution? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.
A. sensitivity-classification labels applied to columns that contain confidential information
B. resource tags for databases that contain confidential information
C. audit logs sent to a Log Analytics workspace
D. dynamic data masking for columns that contain confidential information

A

Correct Answer: AC 🗳️

To efficiently identify queries returning confidential information and users executing them, these components should be part of your solution:

A. Sensitivity-classification labels applied to columns containing confidential information - Helps to mark and identify sensitive data.

C. Audit logs sent to a Log Analytics workspace - Allows monitoring and tracking query execution and the users who ran them, aiding in compliance and auditing.

These components ensure the sensitive data is identified and audited, assisting in tracking user access and complying with data privacy regulations.

A: You can classify columns manually, as an alternative or in addition to the recommendation-based classification:

  1. Select Add classification in the top menu of the pane.
  2. In the context window that opens, select the schema, table, and column that you want to classify, and the information type and sensitivity label.
  3. Select Add classification at the bottom of the context window.
    C: An important aspect of the information-protection paradigm is the ability to monitor access to sensitive data. Azure SQL Auditing has been enhanced to include a new field in the audit log called data_sensitivity_information. This field logs the sensitivity classifications (labels) of the data that was returned by a query. Here’s an example:

Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/data-discovery-and-classification-overview

222
Q

You are designing an enterprise data warehouse in Azure Synapse Analytics that will contain a table named Customers. Customers will contain credit card information.

You need to recommend a solution to provide salespeople with the ability to view all the entries in Customers. The solution must prevent all the salespeople from viewing or inferring the credit card information.

What should you include in the recommendation?

A. data masking
B. Always Encrypted
C. column-level security
D. row-level security

A

Correct Answer: C 🗳️
Column-level security simplifies the design and coding of security in your application, allowing you to restrict column access to protect sensitive data.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/column-level-security

223
Q

You develop data engineering solutions for a company.

A project requires the deployment of data to Azure Data Lake Storage.
You need to implement role-based access control (RBAC) so that project members can manage the Azure Data Lake Storage resources.

Which three actions should you perform? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.
A. Create security groups in Azure Active Directory (Azure AD) and add project members.
B. Configure end-user authentication for the Azure Data Lake Storage account.
C. Assign Azure AD security groups to Azure Data Lake Storage.
D. Configure Service-to-service authentication for the Azure Data Lake Storage account.
E. Configure access control lists (ACL) for the Azure Data Lake Storage account.

A

Correct Answer: ACE 🗳️

To implement role-based access control (RBAC) for Azure Data Lake Storage resources so that project members can manage them, you should perform the following actions:

A. Create security groups in Azure Active Directory (Azure AD) and add project members to these groups. This allows easy management of permissions by assigning roles to these groups rather than individual users.

C. Assign Azure AD security groups to Azure Data Lake Storage. Grant appropriate permissions (such as Read, Write, or Manage) by assigning roles to these security groups at the Azure Data Lake Storage level.

E. Configure access control lists (ACL) for the Azure Data Lake Storage account. ACLs are used to set granular permissions on individual files and folders within the data lake storage, allowing specific access control.

So, the correct actions are A, C, and E to properly set up RBAC for managing Azure Data Lake Storage resources.

AC: Create security groups in Azure Active Directory. Assign users or security groups to Data Lake Storage Gen1 accounts.
E: Assign users or security groups as ACLs to the Data Lake Storage Gen1 file system
Reference:
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-secure-data

224
Q

tough
You have an Azure Data Factory version 2 (V2) resource named Df1. Df1 contains a linked service.

You have an Azure Key vault named vault1 that contains an encryption key named key1.

You need to encrypt Df1 by using key1.

What should you do first?
A. Add a private endpoint connection to vault1.
B. Enable Azure role-based access control on vault1.
C. Remove the linked service from Df1.
D. Create a self-hosted integration runtime.

A

Correct Answer: C 🗳️

Yes, the answer is C, which involves removing the linked service from Df1.

The reasoning is that Azure Data Factory (ADF) doesn’t provide a direct encryption feature for its resources using keys from Azure Key Vault. While Azure Key Vault can manage and store keys, they’re typically used to encrypt data within the data storage or as keys for services like Azure Storage or Azure SQL Database. ADF linked services themselves aren’t directly encrypted using keys from Key Vault. Therefore, in this context, the first logical step would involve removing the linked service from Df1.

I believe this is correct, based on the question: What should you do FIRST?
A DF needs to be empty to be encrypted: https://docs.microsoft.com/en-us/azure/data-factory/enable-customer-managed-key#post-factory-creation-in-data-factory-ui
So FIRST we need to empty the DF - then we can move on.

Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources.
Incorrect Answers:
D: A self-hosted integration runtime copies data between an on-premises store and cloud storage.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/enable-customer-managed-key https://docs.microsoft.com/en-us/azure/data-factory/concepts-linked-services https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime

225
Q

You are designing an Azure Synapse Analytics dedicated SQL pool.
You need to ensure that you can audit access to Personally Identifiable Information (PII).

What should you include in the solution?
A. column-level security
B. dynamic data masking
C. row-level security (RLS)
D. sensitivity classifications

A

Correct Answer: D 🗳️

In the context of auditing access to Personally Identifiable Information (PII), the suitable feature to include in the solution is D. sensitivity classifications.

Sensitivity classifications help in identifying and categorizing data based on its sensitivity level. They allow you to label columns or tables in your Azure Synapse Analytics dedicated SQL pool as containing PII or other sensitive data types. This enables you to track access and take measures to protect the data, aligning with compliance and data privacy requirements. While the other options (column-level security, dynamic data masking, row-level security) offer data protection mechanisms, sensitivity classifications are specifically designed to identify and manage sensitive data for auditing purposes.

Data Discovery & Classification is built into Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics. It provides basic capabilities for discovering, classifying, labeling, and reporting the sensitive data in your databases.
Your most sensitive data might include business, financial, healthcare, or personal information. Discovering and classifying this data can play a pivotal role in your organization’s information-protection approach. It can serve as infrastructure for:
✑ Helping to meet standards for data privacy and requirements for regulatory compliance.
✑ Various security scenarios, such as monitoring (auditing) access to sensitive data.
✑ Controlling access to and hardening the security of databases that contain highly sensitive data.
Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/data-discovery-and-classification-overview

226
Q

link
HOTSPOT -

You have an Azure subscription that contains an Azure Data Lake Storage account. The storage account contains a data lake named DataLake1.

You plan to use an Azure data factory to ingest data from a folder in DataLake1, transform the data, and land the data in another folder.

You need to ensure that the data factory can read and write data from any folder in the DataLake1 container. The solution must meet the following requirements:

  • Minimize the risk of unauthorized user access.
  • Use the principle of least privilege.
  • Minimize maintenance effort.

How should you configure access to the storage account for the data factory? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Use
[ Microsoft Azure Active Directory (Azure AD), part of Microsoft Entra / a shared access signature (SAS) / a shared key]
to authenticate by using
[ a managed identity. / a stored access policy. / an Authorization header.]

A

To configure access to the storage account for the Azure Data Factory in compliance with the specified requirements:

Use Microsoft Azure Active Directory (Azure AD) to authenticate by using a managed identity.

This setup aligns with the principle of least privilege by employing Azure AD authentication, ensuring secure and controlled access to the data lake resources. Additionally, utilizing a managed identity helps in reducing maintenance efforts by handling the credentials automatically and securely within the Azure environment.

227
Q

#9
HOTSPOT -
You are designing an Azure Synapse Analytics dedicated SQL pool.
Groups will have access to sensitive data in the pool as shown in the following table.

Name             | Enhanced access               
-----------------|-------------------------------
Executives       | No access to sensitive data    
Analysts         | Access to in-region sensitive data 
Engineers        | Access to all numeric sensitive data 

You have policies for the sensitive data. The policies vary be region as shown in the following table.

Region    | Data considered sensitive    
----------|-----------------------------
RegionA   | Financial, Personally Identifiable Information (PII)   
RegionB   | Financial, Personally Identifiable Information (PII), medical   
RegionC   | Financial, medical  

You have a table of patients for each region. The tables contain the following potentially sensitive columns.

Sensitive data   | Description                                  | Name
-----------------|----------------------------------------------|-------------
Financial        | Debit/credit card number for charges         | CardOnFile
Medical          | Patient's height in cm                       | Height
PII              | Email address for secure communications      | ContactEmail

You are designing dynamic data masking to maintain compliance.
For each of the following statements, select Yes if the statement is true. Otherwise, select No.

NOTE: Each correct selection is worth one point.
Hot Area:

Analysts in RegionA require dynamic data masking rules for [Patients_RegionA]. YES/NO

Engineers in RegionC require a dynamic data masking rule for [Patients_RegionA], [Height] YES/NO

Engineers in RegionB require a dynamic data masking rule for [Patients_RegionB], [Height] YES/NO

A

For the Analysts in RegionA:

Dynamic data masking rules need to be applied for [Patients_RegionA]. This would be a YES as Analysts in RegionA need access to in-region sensitive data, which includes financial and PII data.

For the Engineers in RegionC:

Engineers in RegionC require a dynamic data masking rule for [Patients_RegionA], [Height]. This would be a NO as RegionC does not consider height as sensitive data.

For the Engineers in RegionB:

Engineers in RegionB require a dynamic data masking rule for [Patients_RegionB], [Height]. This would be a YES as RegionB considers height as sensitive data alongside financial and PII information.

228
Q

DRAG DROP -
You have an Azure Synapse Analytics SQL pool named Pool1 on a logical Microsoft SQL server named Server1.

You need to implement Transparent Data Encryption (TDE) on Pool1 by using a custom key named key1.

Which five actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

Select and Place:

  • Enable TDE on Pool1.
  • Assign a managed identity to Server1.
  • Configure key1 as the TDE protector for Server1.
  • Add key1 to the Azure key vault.
  • Create an Azure key vault and grant the managed identity permissions to the key vault.
A

Step 1: Assign a managed identity to Server1
You will need an existing Managed Instance as a prerequisite.
Step 2: Create an Azure key vault and grant the managed identity permissions to the vault
Create Resource and setup Azure Key Vault.
Step 3: Add key1 to the Azure key vault
The recommended way is to import an existing key from a .pfx file or get an existing key from the vault. Alternatively, generate a new key directly in Azure Key
Vault.
Step 4: Configure key1 as the TDE protector for Server1

Provide TDE Protector key -

Step 5: Enable TDE on Pool1 -
Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/managed-instance/scripts/transparent-data-encryption-byok-powershell

229
Q

You have a data warehouse in Azure Synapse Analytics.
You need to ensure that the data in the data warehouse is encrypted at rest.
What should you enable?
A. Advanced Data Security for this database
B. Transparent Data Encryption (TDE)
C. Secure transfer required
D. Dynamic Data Masking

A

Correct Answer: B 🗳️
Azure SQL Database currently supports encryption at rest for Microsoft-managed service side and client-side encryption scenarios.
✑ Support for server encryption is currently provided through the SQL feature called Transparent Data Encryption.
✑ Client-side encryption of Azure SQL Database data is supported through the Always Encrypted feature.
Reference:
https://docs.microsoft.com/en-us/azure/security/fundamentals/encryption-atrest

230
Q

Hard
You are designing a streaming data solution that will ingest variable volumes of data.

You need to ensure that you can change the partition count after creation.

Which service should you use to ingest the data?
A. Azure Event Hubs Dedicated
B. Azure Stream Analytics
C. Azure Data Factory
D. Azure Synapse Analytics

A

Correct Answer: A 🗳️
You can’t change the partition count for an event hub after its creation except for the event hub in a dedicated cluster.
Reference:
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-features

231
Q

#13

You are designing a date dimension table in an Azure Synapse Analytics dedicated SQL pool. The date dimension table will be used by all the fact tables.

Which distribution type should you recommend to minimize data movement during queries?
A. HASH
B. REPLICATE
C. ROUND_ROBIN

A

Correct Answer: B 🗳️
A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated tables since joins on replicated tables don’t require data movement. Replication requires extra storage, though, and isn’t practical for large tables.
Incorrect Answers:
A: A hash distributed table is designed to achieve high performance for queries on large tables.
C: A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly. Loading data into a round-robin table is fast. Keep in mind that queries can require more data movement than the other distribution methods.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview

232
Q

HOTSPOT -
You develop a dataset named DBTBL1 by using Azure Databricks.

DBTBL1 contains the following columns:
✑ SensorTypeID
✑ GeographyRegionID
✑ Year
✑ Month
✑ Day
✑ Hour
✑ Minute
✑ Temperature
✑ WindSpeed
✑ Other

You need to store the data to support daily incremental load pipelines that vary for each GeographyRegionID. The solution must minimize storage costs.

How should you complete the code? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

df.write
[ bucketBy / format / partitionBy / sortBy ]   [ ("*") / ("GeographyRegionID") / ("GeographyRegionID", "Year", "Month", "Day") / ("Year", "Month", "Day", "GeographyRegionID") ]
.mode ("append")
[ csv("/DBTBL1") / json("/DBTBL1") / parquet("/DBTBL1") / saveAsTable("/DBTBL1") ]
A

df.write
.partitionBy(“GeographyRegionID”, “Year”, “Month”, “Day”)
.mode(“append”)
.parquet(“/DBTBL1”)

233
Q

You are designing a security model for an Azure Synapse Analytics dedicated SQL pool that will support multiple companies.

You need to ensure that users from each company can view only the data of their respective company.

Which two objects should you include in the solution? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.
A. a security policy
B. a custom role-based access control (RBAC) role
C. a predicate function
D. a column encryption key
E. asymmetric keys

A

Debated in chat, go withchat anwser

A. a security policy
C. a predicate function

Suggested Answer: AB 🗳️
A: Row-Level Security (RLS) enables you to use group membership or execution context to control access to rows in a database table. Implement RLS by using the CREATE SECURITY POLICYTransact-SQL statement.
B: Azure Synapse provides a comprehensive and fine-grained access control system, that integrates:
Azure roles for resource management and access to data in storage,

✑ Synapse roles for managing live access to code and execution,
✑ SQL roles for data plane access to data in SQL pools.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/security/row-level-security https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-access-control-overview

ALTERNATE VIEW

To enforce row-level security based on companies, you’d need to employ a predicate function and a security policy.

So, the correct options are:

C. a predicate function
A. a security policy

Some debate in chat to use BA

234
Q

You have a SQL pool in Azure Synapse that contains a table named dbo.Customers. The table contains a column name Email.

You need to prevent nonadministrative users from seeing the full email addresses in the Email column. The users must see values in a format of aXXX@XXXX.com instead.

What should you do?
A. From Microsoft SQL Server Management Studio, set an email mask on the Email column.
B. From the Azure portal, set a mask on the Email column.
C. From Microsoft SQL Server Management Studio, grant the SELECT permission to the users for all the columns in the dbo.Customers table except Email.
D. From the Azure portal, set a sensitivity classification of Confidential for the Email column.

A

Selected Answer: A
Go with A, reason for not B, if email column is string type ,default masking will make it as xxxxxxxx, so here I go with email mask on email column.
https://learn.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview?view=azuresql

235
Q

You have an Azure Data Lake Storage Gen2 account named adls2 that is protected by a virtual network.
You are designing a SQL pool in Azure Synapse that will use adls2 as a source.

What should you use to authenticate to adls2?
A. an Azure Active Directory (Azure AD) user
B. a shared key
C. a shared access signature (SAS)
D. a managed identity

A

Correct Answer: D 🗳️
Managed Identity authentication is required when your storage account is attached to a VNet.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/quickstart-bulk-load-copy-tsql-examples

236
Q

HOTSPOT -
You have an Azure Synapse Analytics SQL pool named Pool1. In Azure Active Directory (Azure AD), you have a security group named Group1.

You need to control the access of Group1 to specific columns and rows in a table in Pool1.

Which Transact-SQL commands should you use? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

To control access to the columns:
* CREATE CRYPTOGRAPHIC PROVIDER
* CREATE PARTITION FUNCTION
* CREATE SECURITY POLICY
* GRANT

To control access to the rows:
* CREATE CRYPTOGRAPHIC PROVIDER
* CREATE PARTITION FUNCTION
* CREATE SECURITY POLICY
* GRANT

A

Box 1: GRANT -
You can implement column-level security with the GRANT T-SQL statement. With this mechanism, both SQL and Azure Active Directory (Azure AD) authentication are supported.

Box 2: CREATE SECURITY POLICY -
Implement RLS by using the CREATE SECURITY POLICY Transact-SQL statement, and predicates created as inline table-valued functions.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/column-level-security https://docs.microsoft.com/en-us/sql/relational-databases/security/row-level-security

237
Q

HOTSPOT -
You need to implement an Azure Databricks cluster that automatically connects to Azure Data Lake Storage Gen2 by using Azure Active Directory (Azure AD) integration.

How should you configure the new cluster? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Tier:
*Premium
*Standard

Advanced option to enable:
*Azure Data Lake Storage Credential Passthrough
*Table Access Control

A

Box 1: Premium -
Credential passthrough requires an Azure Databricks Premium Plan
Box 2: Azure Data Lake Storage credential passthrough
You can access Azure Data Lake Storage using Azure Active Directory credential passthrough.
When you enable your cluster for Azure Data Lake Storage credential passthrough, commands that you run on that cluster can read and write data in Azure Data
Lake Storage without requiring you to configure service principal credentials for access to storage.
Reference:
https://docs.microsoft.com/en-us/azure/databricks/security/credential-passthrough/adls-passthrough

238
Q

You are designing an Azure Synapse solution that will provide a query interface for the data stored in an Azure Storage account. The storage account is only accessible from a virtual network.

You need to recommend an authentication mechanism to ensure that the solution can access the source data.

What should you recommend?
A. a managed identity
B. anonymous public read access
C. a shared key

A

Correct Answer: A 🗳️
Managed Identity authentication is required when your storage account is attached to a VNet.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/quickstart-bulk-load-copy-tsql-examples

239
Q

link
You are developing an application that uses Azure Data Lake Storage Gen2.
You need to recommend a solution to grant permissions to a specific application for a limited time period.

What should you include in the recommendation?
A. role assignments
B. shared access signatures (SAS)
C. Azure Active Directory (Azure AD) identities
D. account keys

A

Correct Answer: B 🗳️
A shared access signature (SAS) provides secure delegated access to resources in your storage account. With a SAS, you have granular control over how a client can access your data. For example:
What resources the client may access.
What permissions they have to those resources.
How long the SAS is valid.
Reference:
https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview

240
Q

no clear anwser
HOTSPOT -
You use Azure Data Lake Storage Gen2 to store data that data scientists and data engineers will query by using Azure Databricks interactive notebooks. Users will have access only to the Data Lake Storage folders that relate to the projects on which they work.

You need to recommend which authentication methods to use for Databricks and Data Lake Storage to provide the users with the appropriate access. The solution must minimize administrative effort and development effort.

Which authentication method should you recommend for each Azure service? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Databricks:

  • Azure Active Directory credential passthrough
  • Azure Key Vault secrets
  • Personal access tokens

Data Lake Storage:

  • Azure Active Directory credential passthrough
  • Shared access keys
  • Shared access signatures
A

Going with Chat
Accessing the ADLS via Databricks should be using Azure Active Directory with Passthrough.
Accessing the files in ADLS should be SAS, based on the options provided.

241
Q

You have an Azure Synapse Analytics dedicated SQL pool that contains a table named Contacts. Contacts contains a column named Phone.

You need to ensure that users in a specific role only see the last four digits of a phone number when querying the Phone column.

What should you include in the solution?
A. table partitions
B. a default value
C. row-level security (RLS)
D. column encryption
E. dynamic data masking

A

Correct Answer: E 🗳️
Dynamic data masking helps prevent unauthorized access to sensitive data by enabling customers to designate how much of the sensitive data to reveal with minimal impact on the application layer. It’s a policy-based security feature that hides the sensitive data in the result set of a query over designated database fields, while the data in the database is not changed.
Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview

242
Q

You are designing database for an Azure Synapse Analytics dedicated SQL pool to support workloads for detecting ecommerce transaction fraud.

Data will be combined from multiple ecommerce sites and can include sensitive financial information such as credit card numbers.

You need to recommend a solution that meets the following requirements:
✑ Users must be able to identify potentially fraudulent transactions.
✑ Users must be able to use credit cards as a potential feature in models.
✑ Users must NOT be able to access the actual credit card numbers.

What should you include in the recommendation?
A. Transparent Data Encryption (TDE)
B. row-level security (RLS)
C. column-level encryption
D. Azure Active Directory (Azure AD) pass-through authentication

A

Correct Answer: C 🗳️
Use Always Encrypted to secure the required columns. You can configure Always Encrypted for individual database columns containing your sensitive data.
Always Encrypted is a feature designed to protect sensitive data, such as credit card numbers or national identification numbers (for example, U.S. social security numbers), stored in Azure SQL Database or SQL Server databases.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/always-encrypted-database-engine

243
Q

just use link
You have an Azure subscription linked to an Azure Active Directory (Azure AD) tenant that contains a service principal named ServicePrincipal1. The subscription contains an Azure Data Lake Storage account named adls1. Adls1 contains a folder named Folder2 that has a URI of https://adls1.dfs.core.windows.net/ container1/Folder1/Folder2/.

ServicePrincipal1 has the access control list (ACL) permissions shown in the following table.

|------------|------------------|
| Folder1    | Access - Execute |
| Folder2    | Access - Read    |

You need to ensure that ServicePrincipal1 can perform the following actions:
✑ Traverse child items that are created in Folder2.
✑ Read files that are created in Folder2.

The solution must use the principle of least privilege.

Which two permissions should you grant to ServicePrincipal1 for Folder2? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.
A. Access ג€” Read
B. Access ג€” Write
C. Access ג€” Execute
D. Default ג€” Read
E. Default ג€” Write
F. Default ג€” Execute

| Resource | Permission |

container1 | Access - Execute |

A

Like 4 different combinations suggested for this, going with most likes
Selected Answer: CD
C. Access ג€” Execute
D. Default ג€” Read
Phrased different, the question for me says: if you create “Folder3” inside Folder2, you should be able to read files created in Folder3.

This means that you for sure need Executive and Read premissions to Folder2 (Executive to traverse child folder, read to read the files).

Now, starting from the least privilege, suppose you give “Access” permission both for read and execute. In this case, you can’t read files created in Folder3. This is a requirement (“child items that are created in Folder2”), so you need Default Read access.

You don’t need Default Execute, otherwise you would have access to a Folder created in Folder3 (say Folder 4) and this is not required so for the least privilege you must give Access Execute and not Defualt Execute.

244
Q

Hard
HOTSPOT -
You have an Azure subscription that is linked to a hybrid Azure Active Directory (Azure AD) tenant. The subscription contains an Azure Synapse Analytics SQL pool named Pool1.

You need to recommend an authentication solution for Pool1. The solution must support multi-factor authentication (MFA) and database-level authentication.

Which authentication solution or solutions should you include in the recommendation? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

MFA:
* Azure AD authentication
* Microsoft SQL Server authentication
* Passwordless authentication
* Windows authentication

Database-level authentication:
* Application roles
* Contained database users
* Database roles
* Microsoft SQL Server logins

A

Box 1: Azure AD authentication -
Azure AD authentication has the option to include MFA.

Box 2: Contained database users -
Azure AD authentication uses contained database users to authenticate identities at the database level.
Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/authentication-mfa-ssms-overview https://docs.microsoft.com/en-us/azure/azure-sql/database/authentication-aad-overview

245
Q

Hard
DRAG DROP -
You have an Azure data factory.

You need to ensure that pipeline-run data is retained for 120 days. The solution must ensure that you can query the data by using the Kusto query language.

Which four actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.

NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select.

Select and Place:

Actions

  • Select the PipelineRuns category.
  • Create a Log Analytics workspace that has Data Retention set to 120 days.
  • Stream to an Azure event hub.
  • Create an Azure Storage account that has a lifecycle policy.
  • From the Azure portal, add a diagnostic setting.
  • Send the data to a Log Analytics workspace.
  • Select the TriggerRuns category.
A

Step 1: Create a Log Analytics workspace that has Data Retention set to 120 days.
Step 2: From Azure Portal, add a diagnostic setting.
Step 3: Select the PipelineRuns Category
Step 4: Send the data to a Log Analytics workspace.

246
Q

You have an Azure Synapse Analytics dedicated SQL pool.

You need to ensure that data in the pool is encrypted at rest. The solution must NOT require modifying applications that query the data.

What should you do?
A. Enable encryption at rest for the Azure Data Lake Storage Gen2 account.
B. Enable Transparent Data Encryption (TDE) for the pool.
C. Use a customer-managed key to enable double encryption for the Azure Synapse workspace.
D. Create an Azure key vault in the Azure subscription grant access to the pool.

A

Correct Answer: B 🗳️
Transparent Data Encryption (TDE) helps protect against the threat of malicious activity by encrypting and decrypting your data at rest. When you encrypt your database, associated backups and transaction log files are encrypted without requiring any changes to your applications. TDE encrypts the storage of an entire database by using a symmetric key called the database encryption key.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-overview-manage-security

247
Q

DRAG DROP -
You have an Azure subscription that contains an Azure Data Lake Storage Gen2 account named storage1. Storage1 contains a container named container1.

Container1 contains a directory named directory1. Directory1 contains a file named file1.

You have an Azure Active Directory (Azure AD) user named User1 that is assigned the Storage Blob Data Reader role for storage1.

You need to ensure that User1 can append data to file1. The solution must use the principle of least privilege.

Which permissions should you grant? To answer, drag the appropriate permissions to the correct resources. Each permission may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

Select and Place:

Permissions

Read

Write

Execute

Answer Area

container1: Permission

directory1: Permission

file1: Permission

A

Box 1: Execute -
If you are granting permissions by using only ACLs (no Azure RBAC), then to grant a security principal read or write access to a file, you’ll need to give the security principal Execute permissions to the root folder of the container, and to each folder in the hierarchy of folders that lead to the file.

Box 2: Execute -
On Directory: Execute (X): Required to traverse the child items of a directory

Box 3: Write -
On file: Write (W): Can write or append to a file.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control

248
Q

link HARD HARD

HOTSPOT -
You have an Azure subscription that contains an Azure Databricks workspace named databricks1 and an Azure Synapse Analytics workspace named synapse1.

The synapse1 workspace contains an Apache Spark pool named pool1.
You need to share an Apache Hive catalog of pool1 with databricks1.

What should you do? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

From synapse1, create a linked service to:

Azure Cosmos DB
Azure Data Lake Storage Gen2
Azure SQL Database

Configure pool1 to use the linked service as:

An Azure Purview account
A Hive metastore
A managed Hive metastore service

A

Box 1: Azure SQL Database -
Use external Hive Metastore for Synapse Spark Pool
Azure Synapse Analytics allows Apache Spark pools in the same workspace to share a managed HMS (Hive Metastore) compatible metastore as their catalog.
Set up linked service to Hive Metastore
Follow below steps to set up a linked service to the external Hive Metastore in Synapse workspace.
1. Open Synapse Studio, go to Manage > Linked services at left, click New to create a new linked service.
2. Set up Hive Metastore linked service
3. Choose Azure SQL Database or Azure Database for MySQL based on your database type, click Continue.
4. Provide Name of the linked service. Record the name of the linked service, this info will be used to configure Spark shortly.
5. You can either select Azure SQL Database/Azure Database for MySQL for the external Hive Metastore from Azure subscription list, or enter the info manually.
6. Provide User name and Password to set up the connection.
7. Test connection to verify the username and password.
8. Click Create to create the linked service.

Box 2: A Hive Metastore -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-external-metastore

249
Q

HOTSPOT -
You have an Azure subscription.

You need to deploy an Azure Data Lake Storage Gen2 Premium account. The solution must meet the following requirements:
* Blobs that are older than 365 days must be deleted.
* Administrative effort must be minimized.
* Costs must be minimized.

What should you use? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

To minimize costs:

Locally-redundant storage (LRS)
The Archive access tier
The Cool access tier
Zone-redundant storage (ZRS)

To delete blobs:

Azure Automation runbooks
Azure Storage lifecycle management
Soft delete

A

Box1: Locally-redundant storage (LRS)
In the question, it specifically states that “You need to deploy an Azure Data Lake Storage Gen2 Premium account”, and Azure Data Lake Storage Gen2 premium tier is neither an Archive access tier nor a Cool Access tier, and so those two options are out. Locally-redundant storage (LRS) is less expensive than Zone-redundant storage (ZRS), so we choose LRS.
https://learn.microsoft.com/en-us/azure/storage/blobs/premium-tier-for-data-lake-storage

Box 2: Azure Storage lifecycle management
With the lifecycle management policy, you can:
* Delete current versions of a blob, previous versions of a blob, or blob snapshots at the end of their lifecycles.
Transition blobs from cool to hot immediately when they’re accessed, to optimize for performance.
Transition current versions of a blob, previous versions of a blob, or blob snapshots to a cooler storage tier if these objects haven’t been accessed or modified for a period of time, to optimize for cost. In this scenario, the lifecycle management policy can move objects from hot to cool, from hot to archive, or from cool to archive.
Etc.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview https://docs.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview

250
Q

HOTSPOT -
You are designing an application that will use an Azure Data Lake Storage Gen 2 account to store petabytes of license plate photos from toll booths. The account will use zone-redundant storage (ZRS).

You identify the following usage patterns:
* The data will be accessed several times a day during the first 30 days after the data is created. The data must meet an availability SLA of 99.9%.
* After 90 days, the data will be accessed infrequently but must be available within 30 seconds.
* After 365 days, the data will be accessed infrequently but must be available within five minutes.

You need to recommend a data retention solution. The solution must minimize costs.

Which access tier should you recommend for each time frame? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

First 30 days:
Archive
Cool
Hot

After 90 days:
Archive
Cool
Hot

After 365 days:
Archive
Cool
Hot

A

Box 1: Hot -
The data will be accessed several times a day during the first 30 days after the data is created. The data must meet an availability SLA of 99.9%.

Box 2: Cool -
After 90 days, the data will be accessed infrequently but must be available within 30 seconds.
Data in the Cool tier should be stored for a minimum of 30 days.
When your data is stored in an online access tier (either Hot or Cool), users can access it immediately. The Hot tier is the best choice for data that is in active use, while the Cool tier is ideal for data that is accessed less frequently, but that still must be available for reading and writing.

Box 3: Cool -
After 365 days, the data will be accessed infrequently but must be available within five minutes.
Incorrect:
Not Archive:
While a blob is in the Archive access tier, it’s considered to be offline and can’t be read or modified. In order to read or modify data in an archived blob, you must first rehydrate the blob to an online tier, either the Hot or Cool tier.

Rehydration priority -
When you rehydrate a blob, you can set the priority for the rehydration operation via the optional x-ms-rehydrate-priority header on a Set Blob Tier or Copy Blob operation. Rehydration priority options include:
Standard priority: The rehydration request will be processed in the order it was received and may take up to 15 hours.
High priority: The rehydration request will be prioritized over standard priority requests and may complete in less than one hour for objects under 10 GB in size.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview https://docs.microsoft.com/en-us/azure/storage/blobs/archive-rehydrate-overview

251
Q

#33 HARD
DRAG DROP
-

You have an Azure Data Lake Storage Gen 2 account named storage1.

You need to recommend a solution for accessing the content in storage1. The solution must meet the following requirements:

  • List and read permissions must be granted at the storage account level.
  • Additional permissions can be applied to individual objects in storage1.
  • Security principals from Microsoft Azure Active Directory (Azure AD), part of Microsoft Entra, must be used for authentication.

What should you use? To answer, drag the appropriate components to the correct requirements. Each component may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.

Components

Access control lists (ACLs)

Role-based access control (RBAC) roles

Shared access signatures (SAS)

Shared account keys

Answer Area

To grant permissions at the storage account level: [xxxxxxxxxxx]

To grant permissions at the object level: [xxxxxxxxxxx]

A

1. Role-based access control (RBAC) rules
2. Access control lists (ACLs)

Access Control Lists (ACLs) are a way of defining and managing permissions on individual objects within a storage system like Azure Data Lake Storage Gen 2. ACLs provide a granular level of control by specifying who can do what with specific files or directories.

In the context of the scenario you provided, using ACLs would allow you to apply additional permissions to individual objects within the ‘storage1’ account. This means you can specify different permissions (like read, write, execute) for different users, groups, or security principals on specific files or folders within the storage account.

Meanwhile, Role-Based Access Control (RBAC) in Azure allows you to assign permissions to users, groups, or applications at different scopes (like subscription, resource group, or individual resources). RBAC roles, when assigned at the storage account level, can grant list and read permissions across the entire storage account.

Therefore, RBAC roles would be used to grant permissions at the storage account level (for list and read permissions), and ACLs would be used to grant additional permissions at the object level (specific files or folders) within the storage account ‘storage1’.

252
Q

You have an Azure Synapse Analytics dedicated SQL pool named Pool1 that contains a table named Sales.

Sales has row-level security (RLS) applied. RLS uses the following predicate filter.

CREATE FUNCTION Security.fn_securitypredicate(@SalesRep AS sysname)
          RETURNS TABLE
WITH SCHEMABINDING
AS
        RETURN SELECT 1 AS fn_securitypredicate_result
WHERE @SalesRep = USER_NAME() OR USER_NAME() = 'Manager';

A user named SalesUser1 is assigned the db_datareader role for Pool1.

Which rows in the Sales table are returned when SalesUser1 queries the table?

A. only the rows for which the value in the User_Name column is SalesUser1
B. all the rows
C. only the rows for which the value in the SalesRep column is Manager
D. only the rows for which the value in the SalesRep column is SalesUser1

A

Correct Answer: D 🗳️
Im pretty sure

253
Q

#35 topic 3

#35
HOTSPOT
-
You have an Azure Data Lake Storage Gen2 account named account1 that contains the resources shown in the following table.

|   Name     |   Type      |          Description         |
|------------|-------------|------------------------------|
| container1 | Container   |         A container           |
| Directory1 | Directory   | A directory in container1    |
| File1      | File        | A file in Directory1         |

You need to configure access control lists (ACLs) to allow a user named User1 to delete File1. User1 is NOT assigned any role-based access control (RBAC) roles for account1. The solution must use the principle of least privilege.

Which type of ACL should you configure for each resource? To answer select the appropriate options in the answer area.

Answer Area

container1:
— permissions
-WX permissions
– X permissions

Directory1:
— permissions
-WX permissions
– X permissions

File1:
— permissions
-WX permissions
– X permissions

A

Answer is
container1: – X permissions
directory1: -WX permissions
file1: — permissions

ref

254
Q

You have an Azure subscription that is linked to a tenant in Microsoft Azure Active Directory (Azure AD), part of Microsoft Entra. The tenant that contains a security group named Group1. The subscription contains an Azure Data Lake Storage account named myaccount1. The myaccount1 account contains two containers named container1 and container2.

You need to grant Group1 read access to container1. The solution must use the principle of least privilege.

Which role should you assign to Group1?

A. Storage Table Data Reader for myaccount1
B. Storage Blob Data Reader for container1
C. Storage Blob Data Reader for myaccount1
D. Storage Table Data Reader for container1

A

The appropriate role to assign to Group1 to grant read access to container1 with the principle of least privilege is option B, Storage Blob Data Reader for container1.

Option A, Storage Table Data Reader for myaccount1, is incorrect because it grants read access to all tables in the storage account, not just container1.

Option C, Storage Blob Data Reader for myaccount1, is incorrect because it grants read access to all containers in the storage account, not just container1.

Option D, Storage Table Data Reader for container1, is incorrect because it grants read access to tables in the specified container only, not blobs in container1.

Therefore, option B, Storage Blob Data Reader for container1, is the most appropriate role to assign Group1 to grant read access to container1 with the principle of least privilege.

255
Q

You have an Azure Synapse Analytics dedicated SQL pool that contains a table named dbo.Users.

You need to prevent a group of users from reading user email addresses from dbo.Users.

What should you use?

A. column-level security
B. row-level security (RLS)
C. Transparent Data Encryption (TOE)
D. dynamic data masking

A

The appropriate feature to use to prevent a group of users from reading user email addresses from dbo.Users in an Azure Synapse Analytics dedicated SQL pool is option A, column-level security.

Option B, row-level security (RLS), is used to filter rows in a table based on the user executing a query, but it cannot prevent certain columns from being read by a group of users.

Option C, Transparent Data Encryption (TDE), encrypts data at rest and does not prevent a group of users from reading specific columns in a table.

Option D, dynamic data masking, is used to mask sensitive data in query results, but it does not prevent a group of users from reading the actual values in a column.

Therefore, option A, column-level security, is the most appropriate feature to use to prevent a group of users from reading user email addresses from dbo.Users in an Azure Synapse Analytics dedicated SQL pool. Column-level security can be used to deny read access to specific columns in a table based on a user or group’s permissions.

256
Q

You have an Azure Synapse Analytics dedicated SQL pool that hosts a database named DB1.

You need to ensure that DB1 meets the following security requirements:
* When credit card numbers show in applications, only the last four digits must be visible.
* Tax numbers must be visible only to specific users.

What should you use for each requirement? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Answer Area

Credit card numbers:
Column-level security
Dynamic Data Masking
Row-level security (RLS)

Tax numbers:
Column-level security
Row-level security (RLS)
Transparent Database Encryption (TDE)

A

Credit card numbers: Dynamic Data Masking

Tax numbers: Column-level secuirty this is dabated, some chat say rls and gpt says rls

257
Q

You have an Azure subscription that contains a storage account named storage1 and an Azure Synapse Analytics dedicated SQL pool. The storage1 account contains a CSV file that requires an account key for access.

You plan to read the contents of the CSV file by using an external table.

You need to create an external data source for the external table.

What should you create first?

A. a database role
B. a database scoped credential
C. a database view
D. an external file format

A

Correct Answer: B 🗳️

To access data stored in Azure Storage within Azure Synapse Analytics, you should create a database scoped credential first. This credential is used to securely store the account key or access key needed to authenticate and access the Azure Storage account. Once you’ve set up the database scoped credential, you can then proceed to create the external data source, which references this credential for authentication when accessing the CSV file.

Refer this link

258
Q

You have a tenant in Microsoft Azure Active Directory (Azure AD), part of Microsoft Entra. The tenant contains a group named Group1.

You have an Azure subscription that contains the resources shown in the following table.

Name         Type                                Description
------------ ----------------------------------- -----------------------------------
ws1          Azure Synapse Analytics workspace    None
storage1     Azure Storage account                Contains CSV files
credential1  Database-scoped credential           Stored in the Azure Synapse Analytics serverless SQL pool in ws1 and used to authenticate to storage1

You need to ensure that members of Group1 can read CSV files from storage1 by using the OPENROWSET function. The solution must meet the following requirements:

  • The members of Group1 must use credential1 to access storage1.
  • The principle of least privilege must be followed.

Which permission should you grant to Group1?

A. EXECUTE
B. CONTROL
C. REFERENCES
D. SELECT

A

GO WITH C: Reference according to chat and documentation

Selected Answer: D
When you’re using the OPENROWSET function to read data from the storage account, you’re actually performing a read operation, not an execute operation. The credential is used implicitly by Azure Synapse to authenticate the session with the storage account and does not require the EXECUTE permission for the user or group accessing it. Instead, you grant permissions that are appropriate for data access. In this case, the SELECT permission is the correct one to use because it allows the members of Group1 to read or select the data.

For granting access to read CSV files from storage1 using the OPENROWSET function with the principle of least privilege, the appropriate permission to grant to Group1 would be D. SELECT. This permission allows reading data from the specified table or view.

This aligns with providing read access to the group without giving broader control or execution rights, ensuring members can perform necessary read operations without additional privileges.

259
Q

topic 4 - #1

You implement an enterprise data warehouse in Azure Synapse Analytics.
You have a large fact table that is 10 terabytes (TB) in size.
Incoming queries use the primary key SaleKey column to retrieve data as displayed in the following table:

see here

| SaleKey | CityKey | CustomerKey | StockItemKey | InvoiceDateKey | Quantity | UnitPrice | TotalExcludingTax |
|---------|---------|-------------|--------------|----------------|----------|-----------|-------------------|
| 49309   | 90858   | 70          | 69           | 10/22/13       | 8        | 16        | 128               |
| 49313   | 55710   | 126         | 69           | 10/22/13       | 2        | 16        | 32                |
| 49343   | 44710   | 234         | 68           | 10/22/13       | 10       | 16        | 160               |
| 49352   | 66109   | 163         | 70           | 10/22/13       | 4        | 16        | 64                |
| 49448   | 65312   | 230         | 70           | 10/22/13       | 8        | 16        | 128               |
| 49646   | 85877   | 271         | 69           | 10/24/13       | 1        | 16        | 16                |
| 49798   | 41238   | 288         | 69           | 10/24/13       | 1        | 16        | 16                |

You need to distribute the large fact table across multiple nodes to optimize performance of the table.
Which technology should you use?

A. hash distributed table with clustered index
B. hash distributed table with clustered Columnstore index
C. round robin distributed table with clustered index
D. round robin distributed table with clustered Columnstore index
E. heap table with distribution replicate

A

Given the scenario of a large fact table and the requirement to optimize performance by distributing the data across multiple nodes in Azure Synapse Analytics, the appropriate technology to use is:

B. hash distributed table with clustered Columnstore index

This approach provides efficient data distribution across nodes (using hashing) and benefits from the performance advantages of a clustered Columnstore index for large analytical workloads.

260
Q

You have an Azure Synapse Analytics dedicated SQL pool that contains a large fact table. The table contains 50 columns and 5 billion rows and is a heap.

Most queries against the table aggregate values from approximately 100 million rows and return only two columns.

You discover that the queries against the fact table are very slow.

Which type of index should you add to provide the fastest query times?
A. nonclustered columnstore
B. clustered columnstore
C. nonclustered
D. clustered

A

Given the scenario with a large fact table in Azure Synapse Analytics containing 5 billion rows and 50 columns, where most queries aggregate values from around 100 million rows and return only two columns, the most suitable index to improve query performance would be:

B. clustered columnstore index

Clustered columnstore indexes are ideal for analytical workloads, especially when performing aggregations on a large volume of data. They provide excellent compression and can significantly enhance query performance for such scenarios by minimizing I/O and speeding up aggregations due to their columnar storage format and batch-based processing capabilities.

261
Q

#3 topic 4

You create an Azure Databricks cluster and specify an additional library to install.

When you attempt to load the library to a notebook, the library in not found.

You need to identify the cause of the issue.

What should you review?
A. notebook logs
B. cluster event logs
C. global init scripts logs
D. workspace logs

A

To identify the cause of the missing library issue when attempting to load it into an Azure Databricks notebook, reviewing the cluster event logs (option B) would be the most appropriate step. These logs often contain valuable information about the cluster’s activity, including events related to library installation or initialization failures. It could highlight any errors or issues encountered during the installation of the additional library onto the cluster.

262
Q

You have an Azure data factory.

You need to examine the pipeline failures from the last 60 days.
What should you use?

A. the Activity log blade for the Data Factory resource
B. the Monitor & Manage app in Data Factory
C. the Resource health blade for the Data Factory resource
D. Azure Monitor

A

Correct Answer: D 🗳️
Data Factory stores pipeline-run data for only 45 days. Use Azure Monitor if you want to keep that data for a longer time.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/monitor-using-azure-monitor

263
Q

You are monitoring an Azure Stream Analytics job.

The Backlogged Input Events count has been 20 for the last hour.

You need to reduce the Backlogged Input Events count.

What should you do?
A. Drop late arriving events from the job.
B. Add an Azure Storage account to the job.
C. Increase the streaming units for the job.
D. Stop the job.

A

Correct Answer: C 🗳️

Increase the streaming units for the job (C): If the workload is consistently high and causing backlog due to processing limitations, scaling up by increasing the streaming units might help distribute the load and process events more efficiently.

General symptoms of the job hitting system resource limits include:
✑ If the backlog event metric keeps increasing, it’s an indicator that the system resource is constrained (either because of output sink throttling, or high CPU).
Note: Backlogged Input Events: Number of input events that are backlogged. A non-zero value for this metric implies that your job isn’t able to keep up with the number of incoming events. If this value is slowly increasing or consistently non-zero, you should scale out your job: adjust Streaming Units.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-scale-jobs https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-monitoring

264
Q

hard
You are designing an Azure Databricks interactive cluster. The cluster will be used infrequently and will be configured for auto-termination.

You need to ensure that the cluster configuration is retained indefinitely after the cluster is terminated. The solution must minimize costs.

What should you do?

A. Pin the cluster.
B. Create an Azure runbook that starts the cluster every 90 days.
C. Terminate the cluster manually when processing completes.
D. Clone the cluster after it is terminated.

A

Correct Answer: A 🗳️

If you want to retain the cluster configuration indefinitely after the cluster is terminated to minimize costs, you should choose:

A. Pin the cluster.

Pinning the cluster will retain its configuration details, ensuring that the configuration is preserved even after the cluster is terminated. This approach allows you to retain the cluster’s configuration without incurring costs for an actively running cluster.

Azure Databricks retains cluster configuration information for up to 70 all-purpose clusters terminated in the last 30 days and up to 30 job clusters recently terminated by the job scheduler. To keep an all-purpose cluster configuration even after it has been terminated for more than 30 days, an administrator can pin a cluster to the cluster list.
Reference:
https://docs.microsoft.com/en-us/azure/databricks/clusters/

265
Q

You have an Azure data solution that contains an enterprise data warehouse in Azure Synapse Analytics named DW1.

Several users execute ad hoc queries to DW1 concurrently.
You regularly perform automated data loads to DW1.

You need to ensure that the automated data loads have enough memory available to complete quickly and successfully when the adhoc queries run.

What should you do?
A. Hash distribute the large fact tables in DW1 before performing the automated data loads.
B. Assign a smaller resource class to the automated data load queries.
C. Assign a larger resource class to the automated data load queries.
D. Create sampled statistics for every column in each table of DW1.

A

Correct Answer: C 🗳️

To ensure that the automated data loads have enough memory available to complete quickly and successfully when concurrent ad hoc queries run, you should:

C. Assign a larger resource class to the automated data load queries.

Assigning a larger resource class to the automated data load queries will allocate more resources (like CPU, memory, etc.) to these queries, ensuring they have the necessary resources to execute efficiently, especially when running concurrently with ad hoc queries. This allocation will help prevent contention and resource competition between the ad hoc queries and the data load processes, improving the overall performance and completion time of the data loads.

The performance capacity of a query is determined by the user’s resource class. Resource classes are pre-determined resource limits in Synapse SQL pool that govern compute resources and concurrency for query execution.
Resource classes can help you configure resources for your queries by setting limits on the number of queries that run concurrently and on the compute- resources assigned to each query. There’s a trade-off between memory and concurrency.
Smaller resource classes reduce the maximum memory per query, but increase concurrency.
Larger resource classes increase the maximum memory per query, but reduce concurrency.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/resource-classes-for-workload-management

266
Q

You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and a database named DB1. DB1 contains a fact table named Table1.

You need to identify the extent of the data skew in Table1.

What should you do in Synapse Studio?
A. Connect to the built-in pool and run DBCC PDW_SHOWSPACEUSED.

B. Connect to the built-in pool and run DBCC CHECKALLOC.

C. Connect to Pool1 and query sys.dm_pdw_node_status.

D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats.

A

To identify the extent of data skew in the fact table Table1 within the Azure Synapse Analytics dedicated SQL pool, you should:

D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats.

Using the sys.dm_pdw_nodes_db_partition_stats DMV (Dynamic Management View) within Synapse Studio connected to Pool1 enables you to retrieve statistics about the distribution of data across distributions and compute nodes. This will help you analyze the distribution of data across nodes and detect any potential data skew in the table partitions.

267
Q

HOTSPOT -
You need to collect application metrics, streaming query events, and application log messages for an Azure Databrick cluster.

Which type of library and workspace should you implement? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Library
Azure Databricks Monitoring Library
Microsoft Azure Management Monitoring Library
PyTorch
TensorFlow

Workspace:
Azure Databricks
Azure Log Analytics
Azure Machine Learning

A

For collecting application metrics, streaming query events, and application log messages for an Azure Databricks cluster:

Library: Azure Databricks Monitoring Library
Workspace: Azure Log Analytics

268
Q

Hard
You have a SQL pool in Azure Synapse.

You discover that some queries fail or take a long time to complete.
You need to monitor for transactions that have rolled back.

Which dynamic management view should you query?
A. sys.dm_pdw_request_steps
B. sys.dm_pdw_nodes_tran_database_transactions
C. sys.dm_pdw_waits
D. sys.dm_pdw_exec_sessions

A

The correct dynamic management view in Azure Synapse to monitor transactions that have rolled back is:
B. sys.dm_pdw_nodes_tran_database_transactions

You can use Dynamic Management Views (DMVs) to monitor your workload including investigating query execution in SQL pool.
If your queries are failing or taking a long time to proceed, you can check and monitor if you have any transactions rolling back.
Example:
– Monitor rollback

SELECT -
SUM(CASE WHEN t.database_transaction_next_undo_lsn IS NOT NULL THEN 1 ELSE 0 END), t.pdw_node_id, nod.[type]
FROM sys.dm_pdw_nodes_tran_database_transactions t
JOIN sys.dm_pdw_nodes nod ON t.pdw_node_id = nod.pdw_node_id
GROUP BY t.pdw_node_id, nod.[type]
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-monitor#monitor-transaction-log-rollback

269
Q

You are monitoring an Azure Stream Analytics job.

You discover that the Backlogged Input Events metric is increasing slowly and is consistently non-zero.

You need to ensure that the job can handle all the events.
What should you do?

A. Change the compatibility level of the Stream Analytics job.
B. Increase the number of streaming units (SUs).
C. Remove any named consumer groups from the connection and use $default.
D. Create an additional output stream for the existing input stream.

A

To address the increasing Backlogged Input Events metric, you should consider:
B. Increase the number of streaming units (SUs).
Increasing the number of streaming units enhances the job’s capacity to handle the incoming event load and potentially reduces backlogged events.

Backlogged Input Events: Number of input events that are backlogged. A non-zero value for this metric implies that your job isn’t able to keep up with the number of incoming events. If this value is slowly increasing or consistently non-zero, you should scale out your job. You should increase the Streaming Units.
Note: Streaming Units (SUs) represents the computing resources that are allocated to execute a Stream Analytics job. The higher the number of SUs, the more
CPU and memory resources are allocated for your job.
Reference:
https://docs.microsoft.com/bs-cyrl-ba/azure/stream-analytics/stream-analytics-monitoring

270
Q

link
You are designing an inventory updates table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore index and will include the following columns:

|-------------------|------------------------------------------------------|
| EventDate         | One million records are added to the table each day   |
| EventTypeID       | The table contains 10 million records for each event type  |
| WarehouseID       | The table contains 100 million records for each warehouse |
| ProductCategoryTypeID | The table contains 25 million records for each product category type |

You identify the following usage patterns:
✑ Analysts will most commonly analyze transactions for a warehouse.
✑ Queries will summarize by product category type, date, and/or inventory event type.

You need to recommend a partition strategy for the table to minimize query times.

On which column should you partition the table?
A. EventTypeID
B. ProductCategoryTypeID
C. EventDate
D. WarehouseID

Table | Comment |

A

Correct Answer: D 🗳️ some chat and gpt say c but go for D
The number of records for each warehouse is big enough for a good partitioning.
Note: Table partitions enable you to divide your data into smaller groups of data. In most cases, table partitions are created on a date column.
When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases.

271
Q

#13 TOPIC 4

You are designing a star schema for a dataset that contains records of online orders. Each record includes an order date, an order due date, and an order ship date.

You need to ensure that the design provides the fastest query times of the records when querying for arbitrary date ranges and aggregating by fiscal calendar attributes.

Which two actions should you perform? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.
A. Create a date dimension table that has a DateTime key.
B. Use built-in SQL functions to extract date attributes.
C. Create a date dimension table that has an integer key in the format of YYYYMMDD.
D. In the fact table, use integer columns for the date fields.
E. Use DateTime columns for the date fields.

A

Should be C and D

272
Q

A company purchases IoT devices to monitor manufacturing machinery. The company uses an Azure IoT Hub to communicate with the IoT devices.

The company must be able to monitor the devices in real-time.

You need to design the solution.

What should you recommend?
A. Azure Analysis Services using Azure Portal
B. Azure Analysis Services using Azure PowerShell
C. Azure Stream Analytics cloud job using Azure Portal
D. Azure Data Factory instance using Microsoft Visual Studio

A

Correct Answer: C 🗳️
In a real-world scenario, you could have hundreds of these sensors generating events as a stream. Ideally, a gateway device would run code to push these events to Azure Event Hubs or Azure IoT Hubs. Your Stream Analytics job would ingest these events from Event Hubs and run real-time analytics queries against the streams.
Create a Stream Analytics job:
In the Azure portal, select + Create a resource from the left navigation menu. Then, select Stream Analytics job from Analytics.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-get-started-with-azure-stream-analytics-to-process-data-from-iot-devices

273
Q

You have a SQL pool in Azure Synapse.

A user reports that queries against the pool take longer than expected to complete. You determine that the issue relates to queried columnstore segments.

You need to add monitoring to the underlying storage to help diagnose the issue.

Which two metrics should you monitor? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.
A. Snapshot Storage Size
B. Cache used percentage
C. DWU Limit
D. Cache hit percentage

A

For monitoring columnstore segments in Azure Synapse SQL pool, the metrics that would help diagnose issues related to queried columnstore segments include:

B. Cache used percentage: This metric indicates how much of the columnstore cache is being used. A high cache used percentage could indicate that the columnstore cache is being heavily utilized, potentially causing performance issues due to cache pressure or inadequate cache size for the workload.

D. Cache hit percentage: This metric reveals the efficiency of the columnstore cache by indicating the percentage of queries that are satisfied by data in the cache rather than needing to fetch data from disk. A lower cache hit percentage might signify that queries are frequently fetching data from disk, impacting query performance.

Monitoring these metrics can provide insights into the columnstore cache utilization and effectiveness, aiding in diagnosing performance issues related to queried columnstore segments in Azure Synapse SQL pool.

274
Q

You manage an enterprise data warehouse in Azure Synapse Analytics.

Users report slow performance when they run commonly used queries.

Users do not report performance changes for infrequently used queries.

You need to monitor resource utilization to determine the source of the performance issues.

Which metric should you monitor?
A. DWU percentage
B. Cache hit percentage
C. DWU limit
D. Data IO percentage

A

Correct Answer: B 🗳️
Monitor and troubleshoot slow query performance by determining whether your workload is optimally leveraging the adaptive cache for dedicated SQL pools.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-how-to-monitor-cache

275
Q

hard

You have an Azure Databricks resource.

You need to log actions that relate to changes in compute for the Databricks resource.

Which Databricks services should you log?
A. clusters
B. workspace
C. DBFS
D. SSH
E. jobs

A

Selected Answer: A

Answer : A.Clusters

why not workspace ?

Workspace is not a service that you should log to track changes in compute for the Databricks resource because it does not record events related to creating, editing, deleting, starting, or stopping clusters or jobs. Workspace events are related to actions performed on the workspace itself, such as creating, renaming, deleting, or importing notebooks, folders, libraries, or repos1. These events do not affect the compute
resources used by the Databricks resource, but rather the workspace content and configuration.
Therefore, workspace is not a relevant service for logging compute changes.

276
Q

You are designing a highly available Azure Data Lake Storage solution that will include geo-zone-redundant storage (GZRS).

You need to monitor for replication delays that can affect the recovery point objective (RPO).

What should you include in the monitoring solution?
A. 5xx: Server Error errors
B. Average Success E2E Latency
C. availability
D. Last Sync Time

A

In monitoring for replication delays that impact the Recovery Point Objective (RPO) in a geo-zone-redundant storage (GZRS) setup, you should include Last Sync Time as part of the monitoring solution. This metric provides information about the time of the last successful synchronization between paired regions, allowing you to gauge any replication delays and assess whether they impact the RPO.

277
Q

Hard

You configure monitoring for an Azure Synapse Analytics implementation. The implementation uses PolyBase to load data from comma-separated value (CSV) files stored in Azure Data Lake Storage Gen2 using an external table.

Files with an invalid schema cause errors to occur.

You need to monitor for an invalid schema error.

For which error should you monitor?
A. EXTERNAL TABLE access failed due to internal error: ‘Java exception raised on call to HdfsBridge_Connect: Error [com.microsoft.polybase.client.KerberosSecureLogin] occurred while accessing external file.’
B. Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. Query aborted- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
C. EXTERNAL TABLE access failed due to internal error: ‘Java exception raised on call to HdfsBridge_Connect: Error [Unable to instantiate LoginClass] occurred while accessing external file.’
D. EXTERNAL TABLE access failed due to internal error: ‘Java exception raised on call to HdfsBridge_Connect: Error [No FileSystem for scheme: wasbs] occurred while accessing external file.’

A

To monitor for an invalid schema error related to PolyBase loading data from CSV files stored in Azure Data Lake Storage Gen2 using an external table, you should monitor error B:

Cannot execute the query "Remote Query" against OLE DB provider "SQLNCLI11" for linked server "(null)". 
Query aborted - the maximum reject threshold (0 rows) was reached while reading from an external source: 
1 rows rejected out of total 1 rows processed.

This error specifically mentions the rejection of rows due to schema issues while reading from an external source, indicating that there are problems with the schema or structure of the data being read from the CSV files.

278
Q

link

You have an Azure Synapse Analytics dedicated SQL pool.

You run PDW_SHOWSPACEUSED(‘dbo.FactInternetSales’); and get the results shown in the following table.

| ROWS       | RESERVED_SPACE | DATA_SPACE | INDEX_SPACE | UNUSED_SPACE | PDW_NODE_ID | DISTRIBUTION_ID |
|------------|----------------|------------|--------------|--------------|-------------|-----------------|
| 694        | 2776           | 616        | 48           | 2112         | 1           | 1               |
| 407        | 2704           | 576        | 48           | 2080         | 1           | 2               |
| 53         | 2376           | 512        | 16           | 1848         | 1           | 3               |
| 58         | 2376           | 512        | 16           | 1848         | 1           | 4               |
| 168        | 2632           | 528        | 32           | 2072         | 1           | 5               |
| 195        | 2696           | 536        | 32           | 2128         | 1           | 6               |
| 5995       | 3464           | 1424       | 32           | 2008         | 1           | 7               |
| 0          | 2232           | 496        | 0            | 1736         | 1           | 8               |
| 264        | 2576           | 544        | 40           | 1992         | 1           | 9               |
| 3008       | 3016           | 960        | 32           | 2024         | 1           | 10              |
| ...        | ...            | ...        | ...          | ...          | ...         | ...             |
| 1550       | 2832           | 752        | 48           | 2032         | 1           | 50              |
| 1238       | 2832           | 696        | 40           | 2096         | 1           | 51              |
| 192        | 2632           | 528        | 32           | 2032         | 1           | 52              |

Which statement accurately describes the dbo.FactInternetSales table?
A. All distributions contain data.
B. The table contains less than 10,000 rows.
C. The table uses round-robin distribution.
D. The table is skewed.

A

Correct Answer: D 🗳️
Data skew means the data is not distributed evenly across the distributions.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

279
Q

You have two fact tables named Flight and Weather. Queries targeting the tables will be based on the join between the following columns.

| Table   | Column               |
|---------|----------------------|
| Flight  | ArrivalAirportID     |
| Flight  | ArrivalDateTime      |
| Weather | AirportID            |
| Weather | ReportDateTime       |

You need to recommend a solution that maximizes query performance.
What should you include in the recommendation?
A. In the tables use a hash distribution of ArrivalDateTime and ReportDateTime.
B. In the tables use a hash distribution of ArrivalAirportID and AirportID.
C. In each table, create an IDENTITY column.
D. In each table, create a column as a composite of the other two columns in the table.

A

Correct Answer: B 🗳️
Hash-distribution improves query performance on large fact tables.
Incorrect Answers:
A: Do not use a date column for hash distribution. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work.

280
Q

topic 4 #22

see question here
HOTSPOT -
You have an Azure Data Factory pipeline that has the activities shown in the following exhibit.

img here

Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.

NOTE: Each correct selection is worth one point.
Hot Area:

Stored procedure1 will execute Web1 and Set variable1 [answer choice]
complete
fail
succeed

If Web1 fails and Set variable2 succeeds, the pipeline status will be [answer choice]
Canceled
Failed
Succeeded

A

GPT and most of chat think this

For the first question, if the stored procedure Stored procedure1 successfully executes Web1 and Set variable1, it would most likely succeed.

Regarding the second question, if Web1 fails and Set variable2 succeeds, the overall pipeline execution would probably end up as Failed, considering the failure in one of the activities within the pipeline.

281
Q

You have several Azure Data Factory pipelines that contain a mix of the following types of activities:
✑ Wrangling data flow
✑ Notebook
✑ Copy
✑ Jar

Which two Azure services should you use to debug the activities? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point
A. Azure Synapse Analytics
B. Azure HDInsight
C. Azure Machine Learning
D. Azure Data Factory
E. Azure Databricks

A

To debug activities within Azure Data Factory pipelines that contain various types of activities like Wrangling data flow, Notebook, Copy, and Jar, you should consider using the following Azure services:

D. Azure Data Factory: It offers built-in debugging capabilities and monitoring features for Data Factory pipelines.

E. Azure Databricks: It provides an environment for debugging, development, and execution of notebooks, making it suitable for debugging notebook activities.

So, the correct options for debugging the activities within these pipelines would be D. Azure Data Factory and E. Azure Databricks.

282
Q

#24

ou have an Azure Synapse Analytics dedicated SQL pool named Pool1 and a database named DB1. DB1 contains a fact table named Table1.

You need to identify the extent of the data skew in Table1.

What should you do in Synapse Studio?

A. Connect to the built-in pool and run sys.dm_pdw_nodes_db_partition_stats.

B. Connect to Pool1 and run DBCC CHECKALLOC.

C. Connect to the built-in pool and run DBCC CHECKALLOC.

D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats.

A

To identify the extent of data skew in Table1 within Azure Synapse Analytics:

D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats.

This dynamic management view (sys.dm_pdw_nodes_db_partition_stats) provides partition-level information like row count and distribution skewness, allowing you to assess data skew across the nodes in your dedicated SQL pool.

283
Q

#25

You manage an enterprise data warehouse in Azure Synapse Analytics.

Users report slow performance when they run commonly used queries.

Users do not report performance changes for infrequently used queries.

You need to monitor resource utilization to determine the source of the performance issues.

Which metric should you monitor?
A. Local tempdb percentage
B. Cache used percentage
C. Data IO percentage
D. CPU percentage

A

Correct Answer: B 🗳️
Monitor and troubleshoot slow query performance by determining whether your workload is optimally leveraging the adaptive cache for dedicated SQL pools.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-how-to-monitor-cache

284
Q

You have an Azure data factory.

You need to examine the pipeline failures from the last 180 days.

What should you use?
A. the Activity log blade for the Data Factory resource
B. Pipeline runs in the Azure Data Factory user experience
C. the Resource health blade for the Data Factory resource
D. Azure Data Factory activity runs in Azure Monitor

A

Correct Answer: D 🗳️
Data Factory stores pipeline-run data for only 45 days. Use Azure Monitor if you want to keep that data for a longer time.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/monitor-using-azure-monitor

285
Q

A company purchases IoT devices to monitor manufacturing machinery. The company uses an Azure IoT Hub to communicate with the IoT devices.

The company must be able to monitor the devices in real-time.

You need to design the solution.

What should you recommend?
A. Azure Analysis Services using Azure PowerShell
B. Azure Stream Analytics Edge application using Microsoft Visual Studio
C. Azure Analysis Services using Microsoft Visual Studio
D. Azure Data Factory instance using Azure Portal

A

For real-time monitoring of IoT devices, especially when utilizing Azure IoT Hub, the suitable recommendation would be:

B. Azure Stream Analytics Edge application using Microsoft Visual Studio

Azure Stream Analytics is designed to process and analyze streaming data in real-time, making it an apt choice for monitoring IoT devices efficiently. The Edge application enables processing close to the data source, which is ideal for IoT scenarios, while Microsoft Visual Studio provides a familiar development environment for creating and deploying Stream Analytics applications.

286
Q

You have an Azure Synapse Analytics dedicated SQL pool named SA1 that contains a table named Table1.

You need to identify tables that have a high percentage of deleted rows.

What should you run?
A. sys.pdw_nodes_column_store_segments
B. sys.dm_db_column_store_row_group_operational_stats
C. sys.pdw_nodes_column_store_row_groups
D. sys.dm_db_column_store_row_group_physical_stats

A

Correct Answer: C 🗳️
Use sys.pdw_nodes_column_store_row_groups to determine which row groups have a high percentage of deleted rows and should be rebuilt.
Note: sys.pdw_nodes_column_store_row_groups provides clustered columnstore index information on a per-segment basis to help the administrator make system management decisions in Azure Synapse Analytics. sys.pdw_nodes_column_store_row_groups has a column for the total number of rows physically stored
(including those marked as deleted) and a column for the number of rows marked as deleted.
Incorrect:
Not A: You can join sys.pdw_nodes_column_store_segments with other system tables to determine the number of columnstore segments per logical table.
Not B: Use sys.dm_db_column_store_row_group_operational_stats to track the length of time a user query must wait to read or write to a compressed rowgroup or partition of a columnstore index, and identify rowgroups that are encountering significant I/O activity or hot spots.

287
Q

Hard
You have an enterprise data warehouse in Azure Synapse Analytics.

You need to monitor the data warehouse to identify whether you must scale up to a higher service level to accommodate the current workloads.

Which is the best metric to monitor?

More than one answer choice may achieve the goal. Select the BEST answer.
A. DWU used
B. CPU percentage
C. DWU percentage
D. Data IO percentage

A

The “DWU percentage” (C) would likely be the best metric to monitor in this scenario. It provides a clear indicator of how much of the current Data Warehouse Unit (DWU) capacity is being utilized at any given time. Monitoring this metric helps in understanding if the current service level is adequate or if there’s a need to scale up to handle the workloads efficiently.

288
Q

A company purchases IoT devices to monitor manufacturing machinery. The company uses an Azure IoT Hub to communicate with the IoT devices.

The company must be able to monitor the devices in real-time.

You need to design the solution.

What should you recommend?
A. Azure Analysis Services using Azure PowerShell
B. Azure Data Factory instance using Azure PowerShell
C. Azure Stream Analytics cloud job using Azure Portal
D. Azure Data Factory instance using Microsoft Visual Studio

A

To monitor IoT devices in real-time, the suitable solution would be:

C. Azure Stream Analytics cloud job using Azure Portal

Azure Stream Analytics provides a real-time data stream processing service that can efficiently handle high volumes of data generated by IoT devices. It’s specifically designed for real-time analytics, making it a fitting choice for monitoring IoT devices as the data streams in, allowing quick and efficient analysis and actions based on that data.

289
Q

#31

HOTSPOT -
You have an Azure event hub named retailhub that has 16 partitions. Transactions are posted to retailhub. Each transaction includes the transaction ID, the individual line items, and the payment details. The transaction ID is used as the partition key.

You are designing an Azure Stream Analytics job to identify potentially fraudulent transactions at a retail store. The job will use retailhub as the input. The job will output the transaction ID, the individual line items, the payment details, a fraud score, and a fraud indicator.

You plan to send the output to an Azure event hub named fraudhub.

You need to ensure that the fraud detection solution is highly scalable and processes transactions as quickly as possible.

How should you structure the output of the Stream Analytics job? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.
Hot Area:

Number of partitions:
1
8
16
32

Partition key:
Fraud indicator
Fraud score
Individual line items
Payment details
Transaction ID

A

Box 1: 16 -
For Event Hubs you need to set the partition key explicitly.
An embarrassingly parallel job is the most scalable scenario in Azure Stream Analytics. It connects one partition of the input to one instance of the query to one partition of the output.

Box 2: Transaction ID -
Reference:
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-features#partitions

For the Stream Analytics job output to be highly scalable and process transactions quickly, you’d want to distribute the data across multiple partitions to leverage parallel processing capabilities. Given the scenario, here’s the optimal configuration:

Number of partitions: 16

Partition key: Transaction ID

Explanation:

The number of partitions should match the number of partitions in the input event hub (retailhub) to ensure efficient processing.
Using Transaction ID as the partition key helps maintain the order of transactions and ensures that all events with the same Transaction ID go to the same partition, allowing easy retrieval and processing of related data.
This setup aligns with the scalability requirement while utilizing the transactional nature of the data for efficient processing.

290
Q

#32 topic 4

HOTSPOT -
You have an on-premises data warehouse that includes the following fact tables. Both tables have the following columns: DateKey, ProductKey, RegionKey.

There are 120 unique product keys and 65 unique region keys.

|----------|-----------------------------------------------------------------------------------------------------------------------|
| Sales    | The table is 600 GB in size. DateKey is used extensively in the WHERE clause in queries. ProductKey is used extensively in join operations. RegionKey is used for grouping. Severity: five percent of records relate to one of 40 regions. |
| Invoice  | The table is 6 GB in size. DateKey and ProductKey are used extensively in the WHERE clause in queries. RegionKey is used for grouping. |

Queries that use the data warehouse take a long time to complete.

You plan to migrate the solution to use Azure Synapse Analytics. You need to ensure that the Azure-based solution optimizes query performance and minimizes processing skew.

What should you recommend? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point
Hot Area:

Sales:  [ Hash-distributed / Round-robin ]   [ DateKey / ProductKey / RegionKey ]

Invoices:     [ Hash-distributed / Round-robin ]     [ DateKey / ProductKey / RegionKey ]

Table | Comments |

A

Chats says
1. Hash Distributed, ProductKey because >2GB and ProductKey is extensively used in joins
2. Hash Distributed, RegionKey because “The table size on disk is more than 2 GB.” and you have to chose a distribution column which: “Is not used in WHERE clauses. This could narrow the query to not run on all the distributions.”

source: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#choosing-a-distribution-column

291
Q

#33

You have a partitioned table in an Azure Synapse Analytics dedicated SQL pool.

You need to design queries to maximize the benefits of partition elimination.

What should you include in the Transact-SQL queries?

A. JOIN
B. WHERE
C. DISTINCT
D. GROUP BY

A

Correct Answer: B 🗳️
To maximize the benefits of partition elimination in Azure Synapse Analytics when querying a partitioned table, you should include the WHERE clause in your Transact-SQL queries. This clause allows you to filter data based on partitioning columns, enabling the system to eliminate irrelevant partitions and focus the query on the necessary data subset.

292
Q

You have an Azure Stream Analytics query. The query returns a result set that contains 10,000 distinct values for a column named clusterID.

You monitor the Stream Analytics job and discover high latency.

You need to reduce the latency.

Which two actions should you perform? Each correct answer presents a complete solution.

NOTE: Each correct selection is worth one point.
A. Add a pass-through query.
B. Increase the number of streaming units.
C. Add a temporal analytic function.
D. Scale out the query by using PARTITION BY.
E. Convert the query to a reference query.

A

CONSENSUS ON THIS
To reduce latency in an Azure Stream Analytics query that returns 10,000 distinct values for a column named clusterID:

B. Increase the number of streaming units: This action increases the processing power available to handle the query workload, potentially reducing latency by allowing more resources to handle the data.

D. Scale out the query by using PARTITION BY: This enables the query to run in parallel on different partitions of the data, enhancing processing speed and potentially decreasing latency.

Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-streaming-unit-consumption https://docs.microsoft.com/en-us/azure/stream-analytics/repartition

293
Q

You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and a database named DB1. DB1 contains a fact table named Table1.

You need to identify the extent of the data skew in Table1.

What should you do in Synapse Studio?

A. Connect to the built-in pool and query sys.dm_pdw_nodes_db_partition_stats.

B. Connect to the built-in pool and run DBCC CHECKALLOC.

C. Connect to Pool1 and query sys.dm_pdw_node_status.

D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats.

A

To identify the extent of data skew in a table within an Azure Synapse Analytics dedicated SQL pool:

D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats. This dynamic management view provides details on the distribution of data across distributions, allowing you to assess data distribution and potential skew across partitions within the table.

294
Q

link
note similiar but different to link

You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a fact table named Table1.

You need to identify the extent of the data skew in Table1.

What should you do in Synapse Studio?
A. Connect to Pool1 and DBCC PDW_SHOWSPACEUSED.

B. Connect to the built-in pool and run DBCC PDW_SHOWSPACEUSED.

C. Connect to the built-in pool and run DBCC CHECKALLOC.

D. Connect to the built-in pool and query sys.dm_pdw_sys_info.

A

A. Connect to Pool1 and DBCC PDW_SHOWSPACEUSED.
Selected Answer: A
https://github.com/rgl/azure-content/blob/master/articles/sql-data-warehouse/sql-data-warehouse-manage-distributed-data-skew.md

295
Q

hard hard
You use Azure Data Lake Storage Gen2.

You need to ensure that workloads can use filter predicates and column projections to filter data at the time the data is read from disk.

Which two actions should you perform? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A. Reregister the Azure Storage resource provider.

B. Create a storage policy that is scoped to a container.

C. Reregister the Microsoft Data Lake Store resource provider.

D. Create a storage policy that is scoped to a container prefix filter.

E. Register the query acceleration feature.

A

Go with chat
E. Register the query acceleration feature.
D. Create a storage policy that is scoped to a container prefix filter.

To filter data at the time it is read from disk, you need to use the query acceleration feature of Azure Data Lake Storage Gen2. To enable this feature, you need to register the query acceleration feature in your Azure subscription.

In addition, you can use storage policies scoped to a container prefix filter to specify which files and directories in a container should be eligible for query acceleration. This can be used to optimize the performance of the queries by only considering a subset of the data in the container.

296
Q

You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a fact table named Table1.

You need to identify the extent of the data skew in Table1.

What should you do in Synapse Studio?

A. Connect to Pool1 and run DBCC PDW_SHOWSPACEUSED.
B. Connect to the built-in pool and run DBCC PDW_SHOWSPACEUSED.
C. Connect to Pool1 and run DBCC CHECKALLOC.
D. Connect to the built-in pool and query sys.dm_pdw_sys_info.

A

Selected Answer: A
Connect to Pool1 and run DBCC PDW_SHOWSPACEUSED

Azure Synapse Analytics dedicated SQL pool (formerly known as Azure Synapse Analytics Parallel Data Warehouse) uses a Massively Parallel Processing (MPP) architecture and DBCC PDW_SHOWSPACEUSED is a system stored procedure that can be used to check the distribution of data across the compute nodes. By running this command on Pool1 and specifying the fact table Table1, you can identify the extent of data skew in Table1 and determine if the data is evenly distributed across the compute nodes or if it is skewed towards a specific node

297
Q

You have an Azure Data Lake Storage Gen2 account that contains two folders named Folder1 and Folder2.

You use Azure Data Factory to copy multiple files from Folder1 to Folder2.

You receive the following error.

Operation on target Copy_sks failed: Failure happened on 'Sink' side.
ErrorCode=DelimitedTextMoreColumnsThanDefined,
'Type=Microsoft.DataTransfer.Common.Snared.HybridDeliveryException,
Message=Error found when processing 'Csv/Tsv Format Text' source
'0_2020_11_09_11_43_32.avro' with row number 53: found more columns than expected column count 27., Source=Microsoft.DataTransfer.Comnon,'

What should you do to resolve the error?

A. Change the Copy activity setting to Binary Copy.
B. Lower the degree of copy parallelism.
C. Add an explicit mapping.
D. Enable fault tolerance to skip incompatible rows.

A

going with majority of chat
Selected Answer: A
Correct answer is A. We are just copying files between folders. Selecting binary copy, ADF will not check schema.
With D we would discard data
With C we would change file contents

298
Q

A company plans to use Apache Spark analytics to analyze intrusion detection data.

You need to recommend a solution to analyze network and system activity data for malicious activities and policy violations. The solution must minimize administrative efforts.

What should you recommend?

A. Azure HDInsight
B. Azure Data Factory
C. Azure Data Lake Storage
D. Azure Databricks

A

D. Azure Databricks

Azure Databricks provides a collaborative Apache Spark-based analytics platform that simplifies and streamlines the process of analyzing data at scale. It offers a powerful environment for processing large volumes of data efficiently, making it an ideal choice for analyzing network and system activity data for malicious activities and policy violations. The collaborative features and optimized Spark-based processing capabilities help minimize administrative efforts while performing complex analytics tasks.

299
Q

You have an Azure Synapse Analytics dedicated SQL pool.

You need to monitor the database for long-running queries and identify which queries are waiting on resources.

Which dynamic management view should you use for each requirement? To answer, select the appropriate options in the answer area.

NOTE: Each correct answer is worth one point.

Monitor the database for long-running queries:
* sys.dm_pdw_exec_requests
* sys.dm_pdw_sql_requests
* sys.dm_pdw_exec_sessions

Identify which queries are waiting on resources:
* sys.dm_pdw_waits
* sys.dm_pdw_lock_waits
* sys.resource_governor_worklood_groups

A

Monitor the database for long-running queries:

sys.dm_pdw_exec_requests

Identify which queries are waiting on resources:

sys.dm_pdw_waits

The sys.dm_pdw_lock_waits view is specific to SQL Server and is used to monitor lock waits and lock resources in regular SQL Server environments, not in Azure Synapse Analytics dedicated SQL pools.

My answers are:
1. sys.dm_pdw_exec_requests
2. sys.dm_pdw_waits
There is a similar question in the microsoft official practice assessment and the explaination is the following:
The sys.dm_pdw_waits view holds information about all wait stats encountered during the execution of a request or query, including locks and waits on a transmission queue

300
Q

tough, dont know
You have an Azure Data Factory pipeline named pipeline1 that includes a Copy activity named Copy1. Copy1 has the following configurations:

  • The source of Copy1 is a table in an on-premises Microsoft SQL Server instance that is accessed by using a linked service connected via a self-hosted integration runtime.
  • The sink of Copy1 uses a table in an Azure SQL database that is accessed by using a linked service connected via an Azure integration runtime.

You need to maximize the amount of compute resources available to Copy1. The solution must minimize administrative effort.

What should you do?

A. Scale out the self-hosted integration runtime.
B. Scale up the data flow runtime of the Azure integration runtime and scale out the self-hosted integration runtime.
C. Scale up the data flow runtime of the Azure integration runtime.

A

NOT SURE ON THIS, CHAT SAY A

A. Scale out the self-hosted integration runtime.

gpt has said different things
Scaling out the self-hosted integration runtime or scaling up the data flow runtime of the Azure integration runtime wouldn’t directly maximize the compute resources available to Copy1. The most effective way to maximize compute resources for Copy1 in this scenario is:

C. Scale up the data flow runtime of the Azure integration runtime.

By increasing the capacity of the Azure integration runtime’s data flow, you enhance its capability to handle and process data more efficiently, which optimizes the compute resources for the Copy activity.

301
Q

Tough question
You are designing a solution that will use tables in Delta Lake on Azure Databricks.

You need to minimize how long it takes to perform the following:

  • Queries against non-partitioned tables
  • Joins on non-partitioned columns

Which two options should you include in the solution? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A. the clone command
B. Z-Ordering
C. Apache Spark caching
D. dynamic file pruning (DFP)

A

For optimizing queries and joins on non-partitioned tables and columns in Delta Lake on Azure Databricks:

B. Z-Ordering: Z-Ordering organizes data within files to colocate related information physically, aiding in efficient query processing and reducing the data shuffle during joins or queries on specific columns.

D. Dynamic File Pruning (DFP): DFP, also known as predicate pushdown, is a feature that leverages file metadata to skip reading irrelevant files, enhancing query performance significantly by reducing the amount of data scanned.

Both Z-Ordering and DFP play crucial roles in improving query performance and join operations on non-partitioned tables and columns in Delta Lake.

302
Q

You have an Azure Data Lake Storage Gen2 account named account1 that contains a container named container1.

You plan to create lifecycle management policy rules for container1.

You need to ensure that you can create rules that will move blobs between access tiers based on when each blob was accessed last.

What should you do first?

A. Configure object replication
B. Create an Azure application
C. Enable access time tracking
D. Enable the hierarchical namespace

A

Selected Answer: C
Answer is correct.
Customers stores huge amount of data in Azure blob storage. Sometimes this data is accessed frequently and other times infrequently. Last access time tracking integrates with the lifecycle of Azure blob storage to allow automatic tiering and deletion of data based on when individual blobs are accessed last.

303
Q

You manage an enterprise data warehouse in Azure Synapse Analytics.

Users report slow performance when they run commonly used queries. Users do not report performance changes for infrequently used queries.

You need to monitor resource utilization to determine the source of the performance issues.

Which metric should you monitor?

A. DWU limit
B. Data IO percentage
C. Cache hit percentage
D. CPU percentage

A

Correct Answer: C 🗳️

304
Q

You have an Azure data factory named DF1 that contains 10 pipelines.

The pipelines are executed hourly by using a schedule trigger. All activities are executed on an Azure integration runtime.

You need to ensure that you can identify trends in queue times across the pipeline executions and activities The solution must minimize administrative effort.

How should you configure the Diagnostic settings for DF1? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Collect:
* Pipeline activity runs log
* Pipeline runs log
* Trigger runs log

Send to:
* Event hub
* Log Analytics workspace
* Storage account

A

Collect:

Pipeline activity runs log

Send to:

Log Analytics workspace

305
Q

You manage an enterprise data warehouse in Azure Synapse Analytics.

Users report slow performance when they run commonly used queries. Users do not report performance changes for infrequently used queries.

You need to monitor resource utilization to determine the source of the performance issues.

Which metric should you monitor?

A. DWU percentage
B. Cache hit percentage
C. Data Warehouse Units (DWU) used
D. Data IO percentage

A

Selected Answer: B
Monitoring DWU used (Option C) can certainly be part of a comprehensive approach to diagnosing the performance issues, focusing on the cache hit percentage (Option B) might offer a more targeted way to address the specific problem described in the scenario.

306
Q

link
HOTSPOT
-

You have an Azure subscription that contains the resources shown in the following table.

|------|----------------------------------------|-------------------------|
| ws1  | Azure Synapse Analytics workspace      | None                    |
| kv1  | Azure Key Vault                        | None                    |
| UAMI1| User-assigned managed identity         | Associated with ws1     |
| sp1  | Apache Spark pool in Azure Synapse Analytics | Associated with ws1 |

You need to ensure that you can run Spark notebooks in ws1.The solution must ensure that you can retrieve secrets from kv1 by using UAMI1.

What should you do? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

In the Azure portal:
* Add a role-based access control (RBAC) role to kv1.
* Add a role-based access control (RBAC) role to ws1.
* Create a linked service to kv1.

In Synapse Studio:
* Add a role-based access control (RBAC) role to kv1.
* Add a role-based access control (RBAC) role to ws1.
* Create a linked service to kv1.

Name | Type | Description |

A

In the Azure portal:
Add a role-based access control (RBAC) role to kv1.

In Synapse Studio:
Create a linked service to kv1.

307
Q

#49 topic 4

You have an Azure Data Factory pipeline shown in the following exhibit.

IMG

The execution log for the first pipeline run is shown in the following exhibit.

IMG

The execution log for the second pipeline run is shown in the following exhibit.

IMG

For each of the following statements, select Yes if the statement is true. Otherwise, select No.

NOTE: Each correct selection is worth one point.

Statements

The Retry property of the Web_GetIP activity is set to 1. YES/NO

The waitOnCompletion property of the Exec_COPY_BLOB activity is set to true. YES/NO

The Exec_COPY_BLOB activity was skipped during the second run due to pipeline dependencies. YES/NO

A

No, No, No

The Retry Property is not set to one for Web_GetIP: Otherwise, we would see a retry of that activity in the first run.

waitOnCompletion property is not set to true: In the second run, Exec_COPY_BLOB takes as long as in the first one, despite being skipped. So, it could not have been waiting for the pipeline that it had triggered to complete.

Exec_COPY_BLOB cannot be skipped due to a pipeline dependency since it is the first activity in the pipeline. Most likely, its activity state was manually set to ‚skipped‘.

308
Q

You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a fact table named Table1.

You need to identify the extent of the data skew in Table1.

What should you do in Synapse Studio?

A. Connect to the built-in pool and query sys.dm_pdw_nodes_db_partition_stats.
B. Connect to Pool1 and run DBCC PDW_SHOWSPACEUSED.
C. Connect to Pool1 and query sys.dm_pdw_node_status.
D. Connect to the built-in pool and query sys.dm_pdw_sys_info.

A

Selected Answer: B
It is indeed B:
LINK

309
Q

hard
You have several Azure Data Factory pipelines that contain a mix of the following types of activities:

  • Power Query
  • Notebook
  • Copy
  • Jar

Which two Azure services should you use to debug the activities? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A. Azure Machine Learning
B. Azure Data Factory
C. Azure Synapse Analytics
D. Azure HDInsight
E. Azure Databricks

A

For debugging activities within your Azure Data Factory pipelines, you should consider using:

B. Azure Data Factory: This service provides features like monitoring, logs, and the ability to view pipeline run details, making it an essential tool for debugging Data Factory workflows.

E. Azure Databricks: It offers comprehensive capabilities for debugging activities, especially for Notebook and Jar activities, providing interactive debugging, log tracking, and monitoring tools.

These services can offer specific functionalities for debugging different activity types within your Data Factory pipelines.

310
Q

A company purchases IoT devices to monitor manufacturing machinery. The company uses an Azure IoT Hub to communicate with the IoT devices.

The company must be able to monitor the devices in real-time.

You need to design the solution.

What should you recommend?

A. Azure Analysis Services using Microsoft Visual Studio
B. Azure Data Factory instance using Azure PowerShell
C. Azure Stream Analytics cloud job using Azure Portal
D. Azure Analysis Services using Azure PowerShell

A

To monitor IoT devices in real-time, the suitable solution would be:

C. Azure Stream Analytics cloud job using Azure Portal

Azure Stream Analytics provides a real-time data stream processing service that can efficiently handle high volumes of data generated by IoT devices. It’s specifically designed for real-time analytics, making it a fitting choice for monitoring IoT devices as the data streams in, allowing quick and efficient analysis and actions based on that data.

311
Q

You have an Azure Synapse Analytics dedicated SQL pool named pool1.

You need to perform a monthly audit of SQL statements that affect sensitive data. The solution must minimize administrative effort.

What should you include in the solution?

A. workload management
B. sensitivity labels
C. dynamic data masking
D. Microsoft Defender for SQL

A

To perform a monthly audit of SQL statements affecting sensitive data with minimal administrative effort in Azure Synapse Analytics dedicated SQL pool, the most suitable approach would be:

B. Sensitivity Labels: Implementing sensitivity labels allows you to classify and label sensitive data. You can then track and audit access to data based on these sensitivity labels. This method enables you to easily identify and audit SQL statements that interact with sensitive information, ensuring compliance and security measures are met.

312
Q

A company purchases IoT devices to monitor manufacturing machinery. The company uses an Azure IoT Hub to communicate with the IoT devices.

The company must be able to monitor the devices in real-time.

You need to design the solution.

What should you recommend?

A. Azure Analysis Services using Azure PowerShell
B. Azure Stream Analytics Edge application using Microsoft Visual Studio
C. Azure Analysis Services using Microsoft Visual Studio
D. Azure Data Factory instance using Azure Portal

A

For real-time monitoring of IoT devices through Azure IoT Hub, the appropriate solution is:

B. Azure Stream Analytics Edge application using Microsoft Visual Studio
Azure Stream Analytics Edge allows real-time data processing close to the data source, enabling immediate analysis and monitoring of incoming data from IoT devices. With Stream Analytics Edge, you can process and analyze data as it arrives, ensuring swift monitoring of manufacturing machinery.

313
Q

You have an Azure data factory.

You execute a pipeline that contains an activity named Activity1. Activity1 produces the following output.

{
  "dataRead": 1208,
  "dataWritten": 1208,
  "filesRead": 1,
  "filesWritten": 1,
  "sourcePeakConnections": 3,
  "sinkPeakConnections": 2,
  "copyDuration": 13,
  "throughput": 0.147,
  "effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (West Central US)",
  "usedDataIntegrationUnits": 4,
  "reportLineageToPurview": {
    "status": "Succeeded",
    "durationInSecond": "4"
  }
}

For each of the following statements, select Yes if the statement is true. Otherwise, select No.

NOTE: Each correct selection is worth one point.

Activity1 is a Copy activity. YES/NO

Activity1 is executed by using a self-hosted integration runtime. YES/NO

The data factory that executed the pipeline is connected to Microsoft Purview. YES/NO

A
  1. Activity1 appears to be a Copy activity given the “dataRead,” “dataWritten,” and “copyDuration” fields.
  2. Activity1 does not use a self-hosted integration runtime; it uses an Azure integration runtime as indicated by “AutoResolveIntegrationRuntime.”
  3. The data factory is connected to Microsoft Purview, as evidenced by the “reportLineageToPurview” section indicating a successful status.
    Hence, the correct answers are:
    - Yes, Activity1 is a Copy activity.
    - No, Activity1 is not executed using a self-hosted integration runtime.
    - Yes, the data factory that executed the pipeline is connected to Microsoft Purview.
314
Q

You manage an enterprise data warehouse in Azure Synapse Analytics.

Users report slow performance when they run commonly used queries.
Users do not report performance changes for infrequently used queries.

You need to monitor resource utilization to determine the source of the performance issues.

Which metric should you monitor?

A. DWU percentage
B. Cache hit percentage
C. DWU limit
D. Data Warehouse Units (DWU) used

A

Correct Answer: B 🗳️

316
Q

FROM DISCUSSION

You are creating a new notebook in Azure Databricks that will support R as the primary language but will also support Scola and SQL.

Which switch should you use to switch between languages?

A. %<language>
B. \\[<language>]
C. \\(<language>)
D. @<Language></Language></language></language></language>

A

Answer: A
Explanation:
You can override the primary language by specifying the language magic
command %<language> at the beginning of a cell. The supported magic commands
are: %python, %r, %scala, and %sql.</language>

In Azure Databricks notebooks, you can switch between languages using “magic commands” or “magic switches.” For R, Scala, and SQL, the magic commands typically used are:

A. %<language> (e.g., %r, %scala, %sql)

So, in this case, the correct switch to switch between languages would be A. %<language>.

317
Q

You have an Azure subscription that contains an Azure Synapse Analytics workspace and a user named User1.

You need to ensure that User1 can review the Azure Synapse Analytics database templates from the gallery. The solution must follow the principle of least privilege.

Which role should you assign to User1?

A. Storage Blob Data Contributor.
B. Synapse Administrator
C. Synapse Contributor
D. Synapse User

A

D: user

318
Q

You have a Log Analytics workspace named la1 and an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 sends logs to la1.

You need to identify whether a recently executed query on Pool1 used the result set cache.

What are two ways to achieve the goal? Each correct answer presents a complete solution.

NOTE: Each correct selection is worth one point.

A. Review the sys.dm_pdw_sql_requests dynamic management view in Pool1.
B. Review the sys.dm_pdw_exec_requests dynamic management view in Pool1.
C. Use the Monitor hub in Synapse Studio.
D. Review the AzureDiagnostics table in la1.
E. Review the sys.dm_pdw_request_steps dynamic management view in Pool1.

A

To identify whether a recently executed query on Azure Synapse Analytics dedicated SQL pool (Pool1) used the result set cache, you can leverage the following methods:

A. Review the sys.dm_pdw_sql_requests dynamic management view in Pool1.
- This view provides information about the queries executed in the dedicated SQL pool, including details about whether the result set cache was used.

B. Review the sys.dm_pdw_exec_requests dynamic management view in Pool1.
- Similar to the sys.dm_pdw_sql_requests view, this view provides insights into the queries executed in the dedicated SQL pool, allowing you to identify if the result set cache was utilized.

These dynamic management views within the SQL pool itself (option A and B) provide specific details about query execution and cache utilization.

The other options:

C. Use the Monitor hub in Synapse Studio.
- While the Monitor hub offers monitoring and insights, it might not provide detailed information about the result set cache usage for individual queries in the SQL pool.

D. Review the AzureDiagnostics table in la1.
- The AzureDiagnostics table in Log Analytics might capture overall system-level diagnostic information, but it might not specifically detail query-level result set cache usage.

E. Review the sys.dm_pdw_request_steps dynamic management view in Pool1.
- This view provides details about the steps involved in query execution but might not specifically highlight the result set cache usage for a query.

So, the correct ways to achieve the goal of identifying whether a recently executed query on Pool1 used the result set cache are options A and B by reviewing the sys.dm_pdw_sql_requests and sys.dm_pdw_exec_requests dynamic management views in Pool1.

319
Q
A

Box 1: SCHEMABINDING

Box 2: Filter

320
Q

You have an Azure data factory named DF1. DF1 contains a single pipeline that is executed by using a schedule trigger.

From Diagnostics settings, you configure pipeline runs to be sent to a resource-specific destination table in a Log Analytics workspace.

You need to run KQL queries against the table.

Which table should you query?

A. ADFPipelineRun
B. ADFTriggerRun
C. ADFActivityRun
D. AzureDiagnostics

A

ADFTriggerRun: This table contains information about the triggers that start the pipeline runs. It tracks when triggers start and their statuses.

321
Q
A
322
Q

You have an Azure subscription that contains an Azure Synapse workspace named WS1 and an Azure Monitor action group named Group1. WS1 has a dedicated SQL pool.

You plan to archive monitoring data for integration activity runs.

You need to ensure that you can configure custom alerts based on the archived data that will execute Group1. The solution must minimize administrative effort.

Which diagnostic setting should you select?

A. Send to Log Analytics workspace
B. Archive to a storage account
C. Stream to an event hub
D. Send to a partner solution

A

A. Send to Log Analytics workspace

323
Q
A
324
Q

You have an Azure subscription that contains an Azure Synapse Analytics workspace name workspace1, workspace1 contains an Azure Synapse Analytics dedicated SQL pool named Pool1.

You create a mapping data flow in an Azure Synapse pipeline that writes data to Pool1.

You execute the data flow and capture the execution information.

You need to identify how long it takes to write the data to Pool1.

Which metric should you use?

A. the rows written
B. the sink processing time
C. the transformation processing time
D. the post processing time

A

To identify how long it takes to write data to Pool1 using a mapping data flow in Azure Synapse Analytics, you should use the metric:

B. The sink processing time

The sink processing time represents the duration taken by the final sink operation, which is responsible for writing the data to the destination (in this case, writing data to Pool1). It measures the time taken from the start of processing within the sink component until it completes writing the data to the target destination.

This metric specifically tracks the time taken during the data writing process, providing insight into the duration it takes for the data flow to complete writing data into the specified destination (Pool1 in this scenario).

325
Q
A