Exam questions Flashcards
You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement.
CREATE TABLE [dbo] . [DimEmployee] ( [EmployeeKey] [int] IDENTITY (1, 1) NOT NULL, [EmployeeID] [int] NOT NULL, [FirstName] [varchar] (100) NOT NULL, [LastName] [varchar] (100) NOT NULL, [JobTitle] [varchar] (100) NULL, [LastHireDate] [date] NULL, [StreetAddress] [varchar] (500) NOT NULL, [City] [varchar] (200) NOT NULL, [StateProvince] [varchar] (50) NOT NULL, [Portalcode] [varchar] (10) NOT NULL )
You need to alter the table to meet the following requirements:
✑ Ensure that users can identify the current manager of employees.
✑ Support creating an employee reporting hierarchy for your entire company.
✑ Provide fast lookup of the managers’ attributes such as name and job title.
Which column should you add to the table?
A. [ManagerEmployeeID] [smallint] NULL
B. [ManagerEmployeeKey] [smallint] NULL
C. [ManagerEmployeeKey] [int] NULL
D. [ManagerName] varchar NULL
Correct Answer: C
We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int column.
C as the data types of the primary key should be same for the manager.
Reference:
https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular
You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.
CREATE TABLE mytestdb.myParquetTable( EmployeeID int, EmployeeName string, EmployeeStartDate date)
USING Parquet - You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.
EmployeeName|EmployeeStartDate|EmployeeID
Alice | 2020-01-25 | 24
One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.
SELECT EmployeeID - FROM mytestdb.dbo.myParquetTable WHERE EmployeeName = 'Alice';
What will be returned by the query?
A. 24
B. an error
C. a null value
I did a test, waited for one minute and tried the query in a serverless sql pool and received 24 as the result, so I don’t understand that B has been voted so much because the answer is A) 24 without a doubt
Debate on B as dollows
Answer is B, but not because of the lowercase. The case has nothing to do with the error.
If you look attentively, you will notice that we create table mytestdb.myParquetTable, but the select statement contains the reference to table mytestdb.dbo.myParquetTable (!!! - dbo).
Here is the error message I got:
Error: spark_catalog requires a single-part namespace, but got [mytestdb, dbo].
I just tried to run the commands, and that error you had is due to the fact that you queried through Spark pool (!!), I did that as a test and got the exact same error. To query the data using Spark Pool, you don’t use the “.dbo” reference, this only works if you’re using a Synapse Serverless Pool.
So the correct answer is A!
HARD
DRAG DROP -
You have a table named SalesFact in an enterprise data warehouse in Azure Synapse Analytics. SalesFact contains sales data from the past 36 months and has the following characteristics:
✑ Is partitioned by month
✑ Contains one billion rows
✑ Has clustered columnstore index
At the beginning of each month, you need to remove data from SalesFact that is older than 36 months as quickly as possible.
Which three actions should you perform in sequence in a stored procedure? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.
Select and Place:
Actions
- Switch the partition containing the stale data from SalesFact to SalesFact_Work.
- Truncate the partition containing the stale data.
- Drop the SalesFact_Work table.
- Create an empty table named SalesFact_Work that has the same schema as SalesFact.
- Execute a DELETE statement where the value in the Date column is more than 36 months ago.
- Copy the data to a new table by using CREATE TABLE AS SELECT (CTAS).
Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact.
Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work.
SQL Data Warehouse supports partition splitting, merging, and switching. To switch partitions between two tables, you must ensure that the partitions align on their respective boundaries and that the table definitions match.
Loading data into partitions with partition switching is a convenient way stage new data in a table that is not visible to users the switch in the new data.
Step 3: Drop the SalesFact_Work table.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition
You have files and folders in Azure Data Lake Storage Gen2 for an Azure Synapse workspace as shown in the following exhibit.
/topfolder/*<children are following> /File1.csv /folder1/File2.csv /folder2/File3.csv /File4.csv
You create an external table named ExtTable that has LOCATION=’/topfolder/’.
When you query ExtTable by using an Azure Synapse Analytics serverless SQL pool, which files are returned?
A. File2.csv and File3.csv only
B. File1.csv and File4.csv only
C. File1.csv, File2.csv, File3.csv, and File4.csv
D. File1.csv only
I believe the answer should be B.
In case of a serverless pool a wildcard should be added to the location.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-create-external-table
“Serverless SQL pool can recursively traverse folders only if you specify /** at the end of path.”
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-folders-multiple-csv-files
HOTSPOT -
You are planning the deployment of Azure Data Lake Storage Gen2.
You have the following two reports that will access the data lake:
✑ Report1: Reads three columns from a file that contains 50 columns.
✑ Report2: Queries a single record based on a timestamp.
You need to recommend in which format to store the data in the data lake to support the reports. The solution must minimize read times.
What should you recommend for each report? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
~~~
Report1:
* Avro
* CSV
* Parquet
* TSV
Report2:
* Avro
* CSV
* Parquet
* TSV
~~~
1: Parquet - column-oriented binary file format
2: AVRO- Row based format, and has logical type timestamp
https://youtu.be/UrWthx8T3UY
You are designing the folder structure for an Azure Data Lake Storage Gen2 container.
Users will query data by using a variety of services including Azure Databricks and Azure Synapse Analytics serverless SQL pools. The data will be secured by subject area. Most queries will include data from the current year or current month.
Which folder structure should you recommend to support fast queries and simplified folder security?
A. /{SubjectArea}/{DataSource}/{DD}/{MM}/{YYYY}/{FileData}{YYYY}{MM}{DD}.csv
B. /{DD}/{MM}/{YYYY}/{SubjectArea}/{DataSource}/{FileData}{YYYY}{MM}{DD}.csv
C. /{YYYY}/{MM}/{DD}/{SubjectArea}/{DataSource}/{FileData}{YYYY}{MM}{DD}.csv
D. /{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}{YYYY}{MM}{DD}.csv
Correct Answer: D
There’s an important reason to put the date at the end of the directory structure. If you want to lock down certain regions or subject matters to users/groups, then you can easily do so with the POSIX permissions. Otherwise, if there was a need to restrict a certain security group to viewing just the UK data or certain planes, with the date structure in front a separate permission would be required for numerous directories under every hour directory. Additionally, having the date structure in front would exponentially increase the number of directories as time went on.
Note: In IoT workloads, there can be a great deal of data being landed in the data store that spans across numerous products, devices, organizations, and customers. It’s important to pre-plan the directory layout for organization, security, and efficient processing of the data for down-stream consumers. A general template to consider might be the following layout:
{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/
Serverless SQL Pools offers a straight-forward method of querying data including CSV, JSON, and Parquet format stored in Azure Storage.
So, setting up the csv files within azure storage in hive-formated folder hierarchy i.e. /{yyyy}/{mm}/{dd}/ actually helps in sql querying the data much faster since only the partitioned segment of the data is queried.
HOTSPOT -
You need to output files from Azure Data Factory.
Which file format should you use for each type of output? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
~~~
Columnar format:
* Avro
* GZip
* Parquet
* TXT
JSON with a timestamp:
* Avro
* GZip
* Parquet
* TXT
~~~
Box 1: Parquet -
Parquet stores data in columns, while Avro stores data in a row-based format. By their very nature, column-oriented data stores are optimized for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads.
Box 2: Avro -
An Avro schema is created using JSON format.
AVRO supports timestamps.
Note: Azure Data Factory supports the following file formats (not GZip or TXT).
Avro format -
✑ Binary format
✑ Delimited text format
✑ Excel format
✑ JSON format
✑ ORC format
✑ Parquet format
✑ XML format
Reference:
https://www.datanami.com/2018/05/16/big-data-file-formats-demystified
HOTSPOT -
You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.
Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and data from a subsidiary of your company.
You need to move the files to a different folder and transform the data to meet the following requirements:
✑ Provide the fastest possible query times.
✑ Automatically infer the schema from the underlying files.
How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Copy behavior: * Flatten hierarchy * Merge files * Preserve hierarchy Sink file type: * CSV * JSON * Parquet * TXT
1. Merge Files
2. Parquet
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance
Larger files lead to better performance and reduced costs.
Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). S
Hard
HOTSPOT -
You have a data model that you plan to implement in a data warehouse in Azure Synapse Analytics as shown in the following exhibit.
see site for img of this
Dim_Employee
* iEmployeeID
* vcEmployeeLastName
* vcEmployeeMName
* vcEmployeeFirstName
* dtEmployeeHireDate
* dtEmployeeLevel
* dtEmployeeLastPromotion
Fact_DailyBookings
* iDailyBookingsID
* iCustomerID
* iTimeID
* iEmployeeID
* iItemID
* iQuantityOrdered
* dExchangeRate
* iCountryofOrigin
* mUnitPrice
Dim_Customer
* iCustomerID
* vcCustomerName
* vcCustomerAddress1
* vcCustomerCity
Dim_Time
* iTimeID
* iCalendarDay
* iCalendarWeek
* iCalendarMonth
* vcDayofWeek
* vcDayofMonth
* vcDayofYear
* iHolidayIndicator
All the dimension tables will be less than 2 GB after compression, and the fact table will be approximately 6 TB. The dimension tables will be relatively static with very few data inserts and updates.
Which type of table should you use for each table? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Dim_Customer: * Hash distributed * Round-robin * Replicated Dim_Employee: * Hash distributed * Round-robin * Replicated Dim_Time: * Hash distributed * Round-robin * Replicated Fact_DailyBookings: * Hash distributed * Round-robin * Replicated
Box 1: Replicated -
Replicated tables are ideal for small star-schema dimension tables, because the fact table is often distributed on a column that is not compatible with the connected dimension tables. If this case applies to your schema, consider changing small dimension tables currently implemented as round-robin to replicated.
Box 2: Replicated -
Box 3: Replicated -
Box 4: Hash-distributed -
For Fact tables use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.
Reference:
https://azure.microsoft.com/en-us/updates/reduce-data-movement-and-make-your-queries-more-efficient-with-the-general-availability-of-replicated-tables/ https://azure.microsoft.com/en-us/blog/replicated-tables-now-generally-available-in-azure-sql-data-warehouse/
The answer is correct.
The Dims are under 2gb so no point in use hash.
Common distribution methods for tables:
The table category often determines which option to choose for distributing the table.
Table category Recommended distribution option
Fact -Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.
Dimension - Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.
Staging - Use round-robin for the staging table. The load with CTAS is fast. Once the data is in the staging table, use INSERT…SELECT to move the data to production tables.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview#common-distribution-methods-for-tables
SIMILIAR TO ANOTHER QUESTION BUT SAME ANWSERS
HOTSPOT -
You have an Azure Data Lake Storage Gen2 container.
Data is ingested into the container, and then transformed by a data integration application. The data is NOT modified after that. Users can read files in the container but cannot modify the files.
You need to design a data archiving solution that meets the following requirements:
✑ New data is accessed frequently and must be available as quickly as possible.
✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.
✑ Data that is older than seven years is NOT accessed. After seven years, the data must be persisted at the lowest cost possible.
✑ Costs must be minimized while maintaining the required availability.
How should you manage the data? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point
Hot Area:
~~~
Five-year-old data:
* Delete the blob.
* Move to archive storage.
* Move to cool storage.
* Move to hot storage.
Seven-year-old data:
* Delete the blob.
* Move to archive storage.
* Move to cool storage.
* Move to hot storage.
~~~
Box 1: Move to cool storage -
Box 2: Move to archive storage -
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.
link
DRAG DROP -
You need to create a partitioned table in an Azure Synapse Analytics dedicated SQL pool.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:
Values
* CLUSTERED INDEX
* COLLATE
* DISTRIBUTION
* PARTITION
* PARTITION FUNCTION
* PARTITION SCHEME
Answer Area
~~~
CREATE TABLE table1
(
ID INTEGER,
col1 VARCHAR(10),
col2 VARCHAR (10)
) WITH
<XXXXXXXXXXXX> = HASH (ID) ,
<YYYYYYYYYYYYY> (ID RANGE LEFT FOR VALUES (1, 1000000, 2000000))
~~~
</YYYYYYYYYYYYY></XXXXXXXXXXXX>
Box 1: DISTRIBUTION -
Table distribution options include DISTRIBUTION = HASH ( distribution_column_name ), assigns each row to one distribution by hashing the value stored in distribution_column_name.
Box 2: PARTITION -
Table partition options. Syntax:
PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,…n] ] ))
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse
You need to design an Azure Synapse Analytics dedicated SQL pool that meets the following requirements:
✑ Can return an employee record from a given point in time.
✑ Maintains the latest employee information.
✑ Minimizes query complexity.
How should you model the employee data?
A. as a temporal table
B. as a SQL graph table
C. as a degenerate dimension table
D. as a Type 2 slowly changing dimension (SCD) table
Correct Answer: D 🗳️
A Type 2 SCD supports versioning of dimension members. Often the source system doesn’t store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example,
IsCurrent) to easily filter by current dimension members.
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types
Hard
You have an enterprise-wide Azure Data Lake Storage Gen2 account. The data lake is accessible only through an Azure virtual network named VNET1.
You are building a SQL pool in Azure Synapse that will use data from the data lake.
Your company has a sales team. All the members of the sales team are in an Azure Active Directory group named Sales. POSIX controls are used to assign the
Sales group access to the files in the data lake.
You plan to load data to the SQL pool every hour.
You need to ensure that the SQL pool can load the sales data from the data lake.
Which three actions should you perform? Each correct answer presents part of the solution.
NOTE: Each area selection is worth one point.
A. Add the managed identity to the Sales group.
B. Use the managed identity as the credentials for the data load process.
C. Create a shared access signature (SAS).
D. Add your Azure Active Directory (Azure AD) account to the Sales group.
E. Use the shared access signature (SAS) as the credentials for the data load process.
F. Create a managed identity.
F. Create a managed identity.
A. Add the managed identity to the Sales group.
B. Use the managed identity as the credentials for the data load process.
The managed identity grants permissions to the dedicated SQL pools in the workspace.
Note: Managed identity for Azure resources is a feature of Azure Active Directory. The feature provides Azure services with an automatically managed identity in
Azure AD -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-identity
14
VIEW WEBSITE FOR IMGs
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool that contains the users shown in the following table.
[MISSING STUFF]
User1 executes a query on the database, and the query returns the results shown in the following exhibit.
[MISSING STUFF]
User1 is the only user who has access to the unmasked data.
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
NOTE: Each correct selection is worth one point.
Hot Area::
When User2 queries the YearlyIncome column, the values returned will be [answer choice]. * a random number * the values stored in the database * XXXX * 0 When User1 queries the BirthDate column, the values returned will be [answer choice]. * a random date * the values stored in the database * xxxX * 1900-01-01
Box 1: 0 -
The YearlyIncome column is of the money data type.
The Default masking function: Full masking according to the data types of the designated fields
✑ Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
Box 2: the values stored in the database
Users with administrator privileges are always excluded from masking, and see the original data without any mask.
Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview
* Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
* Use 01-01-1900 for date/time data types (date, datetime2, datetime, datetimeoffset, smalldatetime, time).
You have an enterprise data warehouse in Azure Synapse Analytics.
Using PolyBase, you create an external table named [Ext].[Items] to query Parquet files stored in Azure Data Lake Storage Gen2 without importing the data to the data warehouse.
The external table has three columns.
You discover that the Parquet files have a fourth column named ItemID.
Which command should you run to add the ItemID column to the external table?
A.
ALTER EXTERNAL TABLE [Ext]. [Items] ADD [ItemID] int;
B.
DROP EXTERNAL FILE FORMAT parquetfilel; CREATE EXTERNAL FILE FORMAT parquetfile1 WITH ( FORMAT_TYPE = PARQUET, DATA_COMPRESSION = 'org.apache. hadoop.io. compress. SnappyCodec'
C.
DROP EXTERNAL TABLE [Ext]. [Items] CREATE EXTERNAL TABLE [Ext]. [Items] ([ItemID] [int] NULL, [ItemName] nvarchar (50) NULL, [ItemType] nvarchar (20) NULL, [ItemDescription] nvarchar (250) ) WITH LOCATION= '/Items/', DATA_SOURCE = AzureDataLakeStore, FILE_FORMAT = PARQUET, REJECT_TYPE = VALUE, REJECT_VALUE = 0
D.
ALTER TABLE [Ext] . [Items] ADD [ItemID] int;
C is correct, since “altering the schema or format of an external SQL table is not supported”.
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/external-sql-tables
16
HOTSPOT -
You have two Azure Storage accounts named Storage1 and Storage2. Each account holds one container and has the hierarchical namespace enabled. The system has files that contain data stored in the Apache Parquet format.
You need to copy folders and files from Storage1 to Storage2 by using a Data Factory copy activity. The solution must meet the following requirements:
✑ No transformations must be performed.
✑ The original folder structure must be retained.
✑ Minimize time required to perform the copy activity.
How should you configure the copy activity? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Source dataset type: * Binary * Parquet * Delimited text Copy activity copy behavior: * FlattenHierarchy * MergeFiles * PreserveHierarchy
Box 1: Parquet -
For Parquet datasets, the type property of the copy activity source must be set to ParquetSource.
Box 2: PreserveHierarchy -
PreserveHierarchy (default): Preserves the file hierarchy in the target folder. The relative path of the source file to the source folder is identical to the relative path of the target file to the target folder.
Incorrect Answers:
✑ FlattenHierarchy: All files from the source folder are in the first level of the target folder. The target files have autogenerated names.
✑ MergeFiles: Merges all files from the source folder to one file. If the file name is specified, the merged file name is the specified name. Otherwise, it’s an autogenerated file name.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-data-lake-storage
Answer seems correct as data is store is parquet already and requirement is to do no transformation so answer is right
You have an Azure Data Lake Storage Gen2 container that contains 100 TB of data.
You need to ensure that the data in the container is available for read workloads in a secondary region if an outage occurs in the primary region . The solution must minimize costs.
Which type of data redundancy should you use?
A. geo-redundant storage (GRS)
B. read-access geo-redundant storage (RA-GRS)
C. zone-redundant storage (ZRS)
D. locally-redundant storage (LRS)
B is right
Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional outages. However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region. When you enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary region becomes unavailable.
You plan to implement an Azure Data Lake Gen 2 storage account.
You need to ensure that the data lake will remain available if a data center failsin the primary Azure region. The solution must minimize costs.
Which type of replication should you use for the storage account?
A. geo-redundant storage (GRS)
B. geo-zone-redundant storage (GZRS)
C. locally-redundant storage (LRS)
D. zone-redundant storage (ZRS)
First, about the Question:
What fails? -> The (complete) DataCenter, not the region and not components inside a DataCenter.
So, what helps us in this situation?
LRS: “..copies your data synchronously three times within a single physical location in the primary region.” Important is here the SINGLE PHYSICAL LOCATION (meaning inside the same Data Center. So in our scenario all copies wouldn’t work anymore.)
-> C is wrong.
ZRS: “…copies your data synchronously across three Azure availability zones in the primary region” (meaning, in different Data Centers. In our scenario this would meet the requirements)
-> D is right
GRS/GZRS: are like LRS/ZRS but with the Data Centers in different azure regions. This works too but is more expensive than ZRS. So ZRS is the right answer.
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy
Hard
HOTSPOT -
You have a SQL pool in Azure Synapse.
You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be loaded daily. The table will be truncated before each daily load.
You need to create the staging table. The solution must minimize how long it takes to load the data to the staging table.
How should you configure the table? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Distribution: * Hash * Replicated * Round-robin Indexing: * Clustered * Clustered columnstore * Heap Partitioning: * Date * None
Distribution: Round-Robin
Indexing: Heap
PartitionIng: None
Round-robin - this is the simplest distribution model, not great for querying but fast to process
Heap - no brainer when creating staging tables
No partitions - this is a staging table, why add effort to partition, when truncated daily?
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.
**Name** **Data type** **Nullable** PurchaseKey Bigint No DateKey Int No SupplierKey Int No StockItemKey Int No PurchaseOrderID Int No OrderedQuantity Int Yes OrderedOuters Int No ReceivedOuters Int No Package Nvarchar(50) No IsOrderFinalized Bit No LineageKey Int No
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.
SELECT - SupplierKey, StockItemKey, IsOrderFinalized, COUNT(*) FROM FactPurchase - WHERE DateKey >= 20210101 - AND DateKey <= 20210131 - GROUP By SupplierKey, StockItemKey, IsOrderFinalized
Which table distribution will minimize query times?
A. replicated
B. hash-distributed on PurchaseKey
C. round-robin
D. hash-distributed on IsOrderFinalized
Correct Answer: B
Hash-distributed tables improve query performance on large fact tables.
To balance the parallel processing, select a distribution column that:
✑ Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values.
✑ Does not have NULLs, or has only a few NULLs.
✑ Is not a date column.
Incorrect Answers:
C: Round-robin tables are useful for improving loading speed.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
Is it hash-distributed on PurchaseKey and not on IsOrderFinalized because ‘IsOrderFinalized’ yields less distributions(rows either contain yes,no values) compared to PurchaseKey?
HOTSPOT -
From a website analytics system, you receive data extracts about user interactions such as downloads, link clicks, form submissions, and video plays.
The data contains the following columns.
Name
- EventCategory
- EventAction
- EventLabel
- ChannelGrouping
- TotalEvents
- UniqueEvents
- SessionWith Events
- Date
Sample value
- 15 Jan 2021
- Videos
- Play
- Contoso Promotional
- Social
- 150
- 120
- 99
You need to design a star schema to support analytical queries of the data. The star schema will contain four tables including a date dimension.
To which table should you add each column? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
EventCategory: * DimChannel * DimDate * DimEvent * FactEvents ChannelGrouping: * DimChannel * DimDate * DimEvent * FactEvents TotalEvents: * DimChannel * DimDate * DimEvent * FactEvents
Box 1: DimEvent -
Box 2: DimChannel -
Box 3: FactEvents -
Fact tables store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, etc
Reference:
https://docs.microsoft.com/en-us/power-bi/guidance/star-schema
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You convert the files to compressed delimited text files.
Does this meet the goal?
A. Yes
B. No
The answer is A
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
Compression doesn’t not only help to reduce the size or space occupied by a file in a storage but also increases the speed of file movement during transfer
Hard
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You copy the files to a table that has a columnstore index.
Does this meet the goal?
A. Yes
B. No
Correct Answer: B
Instead convert the files to compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
From the documentation, loads to heap table are faster than indexed tables. So, better to use heap table than columnstore index table in this case.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#heap-tables
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You modify the files to ensure that each row is more than 1 MB.
Does this meet the goal?
A. Yes
B. No
Correct Answer: B
Instead convert the files to compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
No, rows need to have less than 1 MB. A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size capacity.
You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool.
Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for use in inventory reports. The inventory reports will use the data and additional WHERE parameters depending on the report. The reports will be produced once daily.
You need to implement a solution to make the dataset available for the reports. The solution must minimize query times.
What should you implement?
A. an ordered clustered columnstore index
B. a materialized view
C. result set caching
D. a replicated table
only time maternialised view appears in question set
B is correct.
Materialized view and result set caching
These two features in dedicated SQL pool are used for query performance tuning. Result set caching is used for getting high concurrency and fast response from repetitive queries against static data.
To use the cached result, the form of the cache requesting query must match with the query that produced the cache. In addition, the cached result must apply to the entire query.
Materialized views allow data changes in the base tables. Data in materialized views can be applied to a piece of a query. This support allows the same materialized views to be used by different queries that share some computation for faster performance.
You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1.
You plan to create a database named DB1 in Pool1.
You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL pool.
Which format should you use for the tables in DB1?
A. CSV
B. ORC
C. JSON
D. Parquet
Correct Answer: D
Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each database existing in serverless Apache Spark pools.
For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool database.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables
So A and D. Parquet are faster so D
HARD HARD HARD
You are planning a solution to aggregate streaming data that originates in Apache Kafka and is output to Azure Data Lake Storage Gen2. The developers who will implement the stream processing solution use Java.
Which service should you recommend using to process the streaming data?
A. Azure Event Hubs
B. Azure Data Factory
C. Azure Stream Analytics
D. Azure Databricks
Correct Answer: D
Azure Databrics as the question is clearly asking the support for Java programming.
Reference:
[SEE SITE/REFERENCE FOR CONTEXT]
https://www.examtopics.com/exams/microsoft/dp-203/view/3/
https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/stream-processing
You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files will vary based on the number of events that occur per hour.
File sizes range from 4 KB to 5 GB.
You need to ensure that the files stored in the container are optimized for batch processing.
What should you do?
A. Convert the files to JSON
B. Convert the files to Avro
C. Compress the files
D. Merge the files
Debated quite heavily, see site,GO WITH FOLLOWING
Selected Answer: D
D. Merge the files
To optimize the files stored in the Azure Data Lake Storage Gen2 container for batch processing, you should merge the files. Merging smaller files into larger files is a common optimization technique in data processing scenarios.
Having a large number of small files can introduce overhead in terms of file management, metadata processing, and data scanning. By merging the smaller files into larger files, you can reduce this overhead and improve the efficiency of batch processing operations.
Merging the files is especially beneficial when dealing with varying file sizes, as it helps to create a more balanced distribution of data across the files and reduces the impact of small files on processing performance.
Therefore, in this scenario, merging the files would be the recommended approach to optimize the files for batch processing.
Correct Answer: B
Avro supports batch and is very relevant for streaming.
Note: Avro is framework developed within Apache’s Hadoop project. It is a row-based storage format which is widely used as a serialization process. AVRO stores its schema in JSON format making it easy to read and interpret by any program. The data itself is stored in binary format by doing it compact and efficient.
Reference:
https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/
You can not merge the files if u don’t know how many files exist in ADLS2. In this case, you could easily create a file larger than 100 GB in size and decrease performance. so B is the correct answer. Convert to AVRO
29
HOTSPOT -
You store files in an Azure Data Lake Storage Gen2 container. The container has the storage policy shown in the following exhibit. SEE SITE FOR THIS
~~~
{“rules”: [
{
“enabled”: true,
“name”: “contosorule”,
“type”: “Lifecycle”,
“definition”: {
“actions”: {
“version”: {
“delete”: {
“daysAfterCreationGreaterThan”: 60
“baseBlob”: {
“tierToCool”: {
“daysAfterModificationGreaterThan”: 30
},
“filters”: {
“blobTypes”: [
“blockBlob”
],
“prefixMatch”: [
“container1/contoso”
]
}}}]}
~~~
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
NOTE: Each correct selection is worth one point.
Hot Area:
~~~
The files are [answer choice] after 30 days:
* deleted from the container
* moved to archive storage
* moved to cool storage
* moved to hot storage
The storage policy applies to [answer choice]:
* container1/contoso.csv
* container1/docs/contoso.json
* container1/mycontoso/contoso.csv
~~~
Box 1: moved to cool storage -
The ManagementPolicyBaseBlob.TierToCool property gets or sets the function to tier blobs to cool storage. Support blobs currently at Hot tier.
Box 2: container1/contoso.csv -
As defined by prefixMatch.
prefixMatch: An array of strings for prefixes to be matched. Each rule can define up to 10 case-senstive prefixes. A prefix string must start with a container name.
Reference:
https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.management.storage.fluent.models.managementpolicybaseblob.tiertocool
You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore index and will include the following columns:
✑ TransactionType: 40 million rows per transaction type
✑ CustomerSegment: 4 million per customer segment
✑ TransactionMonth: 65 million rows per month
AccountType: 500 million per account type
You have the following query requirements:
✑ Analysts will most commonly analyze transactions for a given month.
✑ Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type
You need to recommend a partition strategy for the table to minimize query times.
On which column should you recommend partitioning the table?
A. CustomerSegment
B. AccountType
C. TransactionType
D. TransactionMonth
Correct Answer: D
For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases.
Example: Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition.
Select D because analysts will most commonly analyze transactions for a given month,
HOTSPOT -
You have an Azure Data Lake Storage Gen2 account named account1 that stores logs as shown in the following table.
You do not expect that the logs will be accessed during the retention periods.
You need to recommend a solution for account1 that meets the following requirements:
✑ Automatically deletes the logs at the end of each retention period
✑ Minimizes storage costs
What should you include in the recommendation? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
To minimize storage costs:
* Store the infrastructure logs and the application logs in the Archive access tier
* Store the infrastructure logs and the application logs in the Cool access tier
* Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier
To delete logs automatically:
* Azure Data Factory pipelines
* Azure Blob storage lifecycle management rules
* Immutable Azure Blob storage time-based retention policies
Box 1: Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier
“Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive tier and then deleted or moved to the Hot tier after 45 days, you’ll be charged an early deletion fee equivalent to 135 (180 minus 45) days of storing that blob in the Archive tier.”
For infrastructure logs: Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the cool tier should be stored for a minimum of 30 days. The cool tier has lower storage costs and higher access costs compared to the hot tier.
For application logs: Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on the order of hours.
Data in the archive tier should be stored for a minimum of 180 days.
Box 2: Azure Blob storage lifecycle management rules
Blob storage lifecycle management offers a rule-based policy that you can use to transition your data to the desired access tier when your specified conditions are met. You can also use lifecycle management to expire data at the end of its life.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview
DISCUSSION
“Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive tier and then deleted or moved to the Hot tier after 45 days, you’ll be charged an early deletion fee equivalent to 135 (180 minus 45) days of storing that blob in the Archive tier.” <- from the sourced link.
This explains why we have to use two different access tiers rather than both as archive.
You plan to ingest streaming social media data by using Azure Stream Ana
You plan to ingest streaming social media data by using Azure Stream Analytics. The data will be stored in files in Azure Data Lake Storage, and then consumed by using Azure Databricks and PolyBase in Azure Synapse Analytics.
You need to recommend a Stream Analytics data output format to ensure that the queries from Databricks and PolyBase against the files encounter the fewest possible errors. The solution must ensure that the files can be queried quickly and that the data type information is retained.
What should you recommend?
A. JSON
B. Parquet
C. CSV
D. Avro
Correct Answer: B
Need Parquet to support both Databricks and PolyBase.
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql
Avro schema definitions are JSON records. Polybase does not support JSON so why supporting Avro then. A CSV does not contain the schema as it is everything marked as string. so only parquet is left to choose.
You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned fact table named dbo.Sales and a staging table named stg.Sales that has the matching table and partition definitions.
You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in stg.Sales. The solution must minimize load times.
What should you do?
A. Insert the data from stg.Sales into dbo.Sales.
B. Switch the first partition from dbo.Sales to stg.Sales.
C. Switch the first partition from stg.Sales to dbo.Sales.
D. Update dbo.Sales from stg.Sales.
DEBATED
* this must be C. since the need is to overwrite dbo.Sales with the content of stg.Sales.
SWITCH source TO target
* This is quite a weird situation because according to Microsoft documentation: “When reassigning a table’s data as a partition to an already-existing partitioned table, or switching a partition from one partitioned table to another, the target partition must exist and it MUST BE EMPTY.” (https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-table-transact-sql?view=azure-sqldw-latest&preserve-view=true#switch–partition-source_partition_number_expression–to–schema_name–target_table–partition-target_partition_number_expression-) Therefore none of the options would be possible if considering that both tables are not empty on that partition. Then I have no idea what would be the correct answer, although I answered C.
Exam Topics pick
Correct Answer: B 🗳️
A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data monthly. Then you can switch out the partition with data for an empty partition from another table
Note: Syntax:
SWITCH [ PARTITION source_partition_number_expression ] TO [ schema_name. ] target_table [ PARTITION target_partition_number_expression ]
Switches a block of data in one of the following ways:
✑ Reassigns all data of a table as a partition to an already-existing partitioned table.
✑ Switches a partition from one partitioned table to another.
✑ Reassigns all data in one partition of a partitioned table to an existing non-partitioned table.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool
Hard
You are designing a slowly changing dimension (SCD) for supplier data in an Azure Synapse Analytics dedicated SQL pool.
You plan to keep a record of changes to the available fields.
The supplier data contains the following columns.
Name - Description -----------------------|------------------------------------------------------ SupplierSystemID - Unique supplier ID in an enterprise resource planning (ERP) system SupplierName - Name of the supplier company SupplierAddress1 - Address of the supplier company SupplierAddress2 - Second address of the supplier company (if applicable) SupplierCity - City of the supplier company SupplierStateProvince - State or province of the supplier company SupplierCountry - Country of the supplier company SupplierPostalCode - Postal code of the supplier company SupplierDescription - Free-text description of the supplier company SupplierCategory - Category of goods provided by the supplier company
Which three additional columns should you add to the data to create a Type 2 SCD? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. surrogate primary key
B. effective start date
C. business key
D. last modified date
E. effective end date
F. foreign key
DEBATED but pretty confident in following
The answer is ABE. A type 2 SCD requires a surrogate key to uniquely identify each record when versioning.
See https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types under SCD Type 2 “ the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member.”
A business key is already part of this table - SupplierSystemID. The column is derived from the source data.
Exam Topics Anwser
Correct Answer: BCE
C: The Slowly Changing Dimension transformation requires at least one business key column.
BE: Historical attribute changes create new records instead of updating existing ones. The only change that is permitted in an existing record is an update to a column that indicates whether the record is current or expired. This kind of change is equivalent to a Type 2 change. The Slowly Changing Dimension transformation directs these rows to two outputs: Historical Attribute Inserts Output and New Output.
Reference:
https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation
hard
HOTSPOT -
You have a Microsoft SQL Server database that uses a third normal form schema.
You plan to migrate the data in the database to a star schema in an Azure Synapse Analytics dedicated SQL pool
You need to design the dimension tables. The solution must optimize read operations.
What should you include in the solution? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Transform data for the dimension tables by:
- Maintaining to a third normal form
- Normalizing to a fourth normal form
- Denormalizing to a second normal form
For the primary key columns in the dimension tables, use:
- New IDENTITY columns
- A new computed column
- The business key column from the source sys
Box 1: Denormalize to a second normal form
Denormalization is the process of transforming higher normal forms to lower normal forms via storing the join of higher normal form relations as a base relation.
Denormalization increases the performance in data retrieval at cost of bringing update anomalies to a database.
Box 2: New identity columns -
The collapsing relations strategy can be used in this step to collapse classification entities into component entities to obtain flat dimension tables with single-part keys that connect directly to the fact table. The single-part key is a surrogate key generated to ensure it remains unique over time.
Example:
Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance.
Reference:
https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/ https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity
HOTSPOT -
You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns:
✑ ProductID
✑ ItemPrice
✑ LineTotal
✑ Quantity
✑ StoreID
✑ Minute
✑ Month
✑ Hour
✑ Year
✑ Day
You need to store the data to support hourly incremental load pipelines that will vary for each Store ID. The solution must minimize storage costs.
How should you complete the code?
To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
df.write
* . bucketBy
* . partitionBy
* . range
* . sortBy
- (“*”)
- (“StoreID”, “Hour”)
- (“StoreID”,”Year”, “Month”, “Day”, “Hour”)
.mode (“append”)
* .csv (“/Purchases”)
* . json (“/Purchases”)
* . parquet (“/Purchases”)
* . saveAsTable (“/Purchases”)
Box 1: partitionBy -
We should overwrite at the partition level.
Example:
df.write.partitionBy(“y”,”m”,”d”)
.mode(SaveMode.Append)
.parquet(“/data/hive/warehouse/db_name.db/” + tableName)
Box 2: (“StoreID”, “Year”, “Month”, “Day”, “Hour”, “StoreID”)
if partitioned by storeid and hour only, the same hours from different days would go to the same partition, that would be innefficient
Box 3: parquet(“/Purchases”)
Reference:
https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data
hard hard
You are designing a partition strategy for a fact table in an Azure Synapse Analytics dedicated SQL pool. The table has the following specifications:
✑ Contain sales data for 20,000 products.
Use hash distribution on a column named ProductID.
✑ Contain 2.4 billion records for the years 2019 and 2020.
Which number of partition ranges provides optimal compression and performance for the clustered columnstore index?
A. 40
B. 240
C. 400
D. 2,400
Correct Answer: A
Each partition should have around 1 millions records. Dedication SQL pools already have 60 partitions.
We have the formula: Records/(Partitions*60)= 1 million
Partitions= Records/(1 million * 60)
Partitions= 2.4 x 1,000,000,000/(1,000,000 * 60) = 40
Note: Having too many partitions can reduce the effectiveness of clustered columnstore indexes if each partition has fewer than 1 million rows. Dedicated SQL pools automatically partition your data into 60 databases. So, if you create a table with 100 partitions, the result will be 6000 partitions.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool
hard link
HOTSPOT -
You are creating dimensions for a data warehouse in an Azure Synapse Analytics dedicated SQL pool.
You create a table by using the Transact-SQL statement shown in the following exhibit.
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
NOTE: Each correct selection is worth one point.
Hot Area:
DimProduct is a [answer choice] slowly changing dimension (SCD). * Type 0 * Type 1 * Type 2 The ProductKey column is [answer choice]. * a surrogate key * a business key * an audit column
Type2 because there are start and end columns and ProductKey is a surrogate key. ProductNumber seems a business key.
product key is a surrogate key as it is an identity column
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.
~~~
| Name | Data type | Nullable |
|——————|————-|———-|
| PurchaseKey | Bigint | No |
| DateKey | Int | No |
| SupplierKey | Int | No |
| StockItemKey | Int | No |
| PurchaseOrderID | Int | Yes |
| OrderedQuantity | Int | No |
| OrderedOuters | Int | No |
| ReceivedOuters | Int | No |
| Package | Nvarchar(50)| No |
| IsOrderFinalized | Bit | No |
| LineageKey | Int | No |
~~~
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.
SELECT - SupplierKey, StockItemKey, COUNT(*) FROM FactPurchase - WHERE DateKey >= 20210101 - AND DateKey <= 20210131 - GROUP By SupplierKey, StockItemKey
Which table distribution will minimize query times?
A. replicated
B. hash-distributed on PurchaseKey
C. round-robin
D. hash-distributed on DateKey
Correct Answer: B
Hash-distributed tables improve query performance on large fact tables, and are the focus of this article. Round-robin tables are useful for improving loading speed.
Incorrect:
Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
Lots of discusssion on this one
You are implementing a batch dataset in the Parquet format.
Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure Synapse Analytics serverless SQL pool.
You need to minimize storage costs for the solution.
What should you do?
A. Use Snappy compression for the files.
B. Use OPENROWSET to query the Parquet files.
C. Create an external table that contains a subset of columns from the Parquet files.
D. Store all data as string in the Parquet files.
Answer should be A, because this talks about minimizing storage costs, not querying costs
id go with A
Lots of debating this
HARD hard link
DRAG DROP -
You need to build a solution to ensure that users can query specific files in an Azure Data Lake Storage Gen2 account from an Azure Synapse Analytics serverless SQL pool.
Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.
NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select.
Select and Place:
* Create an external file format object1 * Create an external data source * Create a query that uses Create Table as Select * Create a table * Create an external table
Step 1: Create an external data source
You can create external tables in Synapse SQL pools via the following steps:
1. CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the storage.
2. CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files.
Step 2: Create an external file format object
Creating an external file format is a prerequisite for creating an external table.
3. CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.
Step 3: Create an external table
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables
You are designing a data mart for the human resources (HR) department at your company. The data mart will contain employee information and employee transactions.
From a source system, you have a flat extract that has the following fields:
✑ EmployeeID
FirstName -
✑ LastName
✑ Recipient
✑ GrossAmount
✑ TransactionID
✑ GovernmentID
✑ NetAmountPaid
✑ TransactionDate
You need to design a star schema data model in an Azure Synapse Analytics dedicated SQL pool for the data mart.
Which two tables should you create? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. a dimension table for Transaction B. a dimension table for EmployeeTransaction C. a dimension table for Employee D. a fact table for Employee E. a fact table for Transaction
C. a dimension table for Employee
E. a fact table for Transaction
C: Dimension tables contain attribute data that might change but usually changes infrequently. For example, a customer’s name and address are stored in a dimension table and updated only when the customer’s profile changes. To minimize the size of a large fact table, the customer’s name and address don’t need to be in every row of a fact table. Instead, the fact table and the dimension table can share a customer ID. A query can join the two tables to associate a customer’s profile and transactions.
E: Fact tables contain quantitative data that are commonly generated in a transactional system, and then loaded into the dedicated SQL pool. For example, a retail business generates sales transactions every day, and then loads the data into a dedicated SQL pool fact table for analysis.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview
You are designing a dimension table for a data warehouse. The table will track the value of the dimension attributes over time and preserve the history of the data by adding new rows as the data changes.
Which type of slowly changing dimension (SCD) should you use?
A. Type 0 B. Type 1 C. Type 2 D. Type 3
Correct Answer: C
A Type 2 SCD supports versioning of dimension members. Often the source system doesn’t store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example,M IsCurrent) to easily filter by current dimension members.
Incorrect
Answers:
B: A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten.
D: A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD.
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types
DRAG DROP -
You have data stored in thousands of CSV files in Azure Data Lake Storage Gen2. Each file has a header row followed by a properly formatted carriage return (/ r) and line feed (/n).
You are implementing a pattern that batch loads the files daily into a dedicated SQL pool in Azure Synapse Analytics by using PolyBase.
You need to skip the header row when you import the files into the data warehouse. Before building the loading pattern, you need to prepare the required database objects in Azure Synapse Analytics.
Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.
NOTE: Each correct selection is worth one point
Select and Place:
* Create a database scoped credential that uses Azure Active Directory Application and a Service Principal Key * Create an external data source that uses the abfs location * Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages * Create an external file format and set the First Row option
go with chat
STEP 1: Create a database scoped credential that uses Azure Active Directory Application and a Service Principal Key
Step 2: Create an external data source that uses the abfs location
Create External Data Source to reference Azure Data Lake Store Gen 1 or 2
Step 3: Create an external file format and set the First_Row option.
DEBATED
Examtopics anwser
Step 1: Create an external data source that uses the abfs location
Create External Data Source to reference Azure Data Lake Store Gen 1 or 2
Step 2: Create an external file format and set the First_Row option.
Create External File Format.
Step 3: Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages
To use PolyBase, you must create external tables to reference your external data.
Use reject options.
Note: REJECT options don’t apply at the time this CREATE EXTERNAL TABLE AS SELECT statement is run. Instead, they’re specified here so that the database can use them at a later time when it imports data from the external table. Later, when the CREATE TABLE AS SELECT statement selects data from the external table, the database will use the reject options to determine the number or percentage of rows that can fail to import before it stops the import.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-t-sql-objects https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-as-select-transact-sql
hard GO TO LINK FOR THIS
HOTSPOT -
You are building an Azure Synapse Analytics dedicated SQL pool that will contain a fact table for transactions from the first half of the year 2020.
You need to ensure that the table meets the following requirements:
✑ Minimizes the processing time to delete data that is older than 10 years
✑ Minimizes the I/O for queries that use year-to-date values
How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
CREATE TABLE [dbo] . [FactTransaction] ( [TransactionTypeID] int NOT NULL [TransactionDateID] int NOT NULL [CustomerID] int NOT NULL [RecipientID] int NOT NULL [Amount] money NOT NULL ) WITH ( XXXXXXXXXXXXXXXXXXXX ( YYYYYYYYYYYYYYYYYYYY RANGE RIGHT FOR VALUES (20200101, 20200201, 20200301, 20200401, 20200501, 20200601)
Box 1: PARTITION -
RANGE RIGHT FOR VALUES is used with PARTITION.
Part 2: [TransactionDateID]
Partition on the date column.
Example: Creating a RANGE RIGHT partition function on a datetime column
The following partition function partitions a table or index into 12 partitions, one for each month of a year’s worth of values in a datetime column.
CREATE PARTITION FUNCTION [myDateRangePF1] (datetime)
AS RANGE RIGHT FOR VALUES (‘20030201’, ‘20030301’, ‘20030401’,
‘20030501’, ‘20030601’, ‘20030701’, ‘20030801’,
‘20030901’, ‘20031001’, ‘20031101’, ‘20031201’);
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql
You are performing exploratory analysis of the bus fare data in an Azure Data Lake Storage Gen2 account by using an Azure Synapse Analytics serverless SQL pool.
You execute the Transact-SQL query shown in the following exhibit.
SELECT payment_type, SUM (fare_amount) AS fare total FROM OPENROWSET ( BULK 'csv/busfare/tripdata_2020 *. csv', DATA SOURCE = 'BusData', FORMAT = 'CSV', PARSER VERSION = '2.0', FIRSTROW = 2 WITH ( payment_type INT 10, fare_amount FLOAT 11 ) AS nyc GROUP BY payment_type ORDER BY payment_type;
What do the query results include?
A. Only CSV files in the tripdata_2020 subfolder.
B. All files that have file names that beginning with “tripdata_2020”.
C. All CSV files that have file names that contain “tripdata_2020”.
D. Only CSV that have file names that beginning with “tripdata_2020”.
Correct Answer: D
hard link
DRAG DROP -
You use PySpark in Azure Databricks to parse the following JSON input.
{ "persons": [ { "name":"Keith", "age":30, "dogs":["Fido", "Fluffy"] { "name":"Donna", "age":46, "dogs": ["Spot"] }
You need to output the data in the following tabular format.
|-------|-----|--------| | Donna | 46 | Spot |
How should you complete the PySpark code? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the spit bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:
Values
* alias
* array_union
* createDataFrame
* explode
* select
* translate
dbutils.fs.put("/tmp/source.json", source_json, True) source_df = spark.read.option ("multiline", "true"). json("/tmp/source. json") persons = source_df. XXXXXXXX YYYYYYYYYY ("persons") .alias("persons")) persons_dogs = persons.select (col ("persons.name") .alias ("owner"), col ("persons.age") .alias ("age"), explode ZZZZZZZZZZ ("dog")) ("persons-dogs"). display (persons_dogs)
| Keith | 30 | Fido |
| Keith | 30 | Fluffy |
owner | age | dog |
Box 1: select -
Box 2: explode -
Bop 3: alias -
pyspark.sql.Column.alias returns this column aliased with a new name or names (in the case of expressions that return more than one column, such as explode).
Reference:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.alias.html https://docs.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/explode
HOTSPOT -
You are designing an application that will store petabytes of medical imaging data.
When the data is first created, the data will be accessed frequently during the first week. After one month, the data must be accessible within 30 seconds, but files will be accessed infrequently. After one year, the data will be accessed infrequently but must be accessible within five minutes.
You need to select a storage strategy for the data. The solution must minimize costs.
Which storage tier should you use for each time frame? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
~~~
First week:
* Archive
* Cool
* Hot
After one month:
* Archive
* Cool
* Hot
After one year:
* Archive
* Cool
* Hot
~~~
First Week: Hot
After One Month: Cool
After OneYear: Cool
hard
You have an Azure Synapse Analytics Apache Spark pool named Pool1.
You plan to load JSON files from an Azure Data Lake Storage Gen2 container into the tables in Pool1. The structure and data types vary by file.
You need to load the files into the tables. The solution must maintain the source data types.
What should you do?
A. Use a Conditional Split transformation in an Azure Synapse data flow.
B. Use a Get Metadata activity in Azure Data Factory.
C. Load the data by using the OPENROWSET Transact-SQL command in an Azure Synapse Analytics serverless SQL pool.
D. Load the data by using PySpark.
Should be D, it’s about Apache Spark pool, not serverless SQL pool.
You have an Azure Databricks workspace named workspace1 in the Standard pricing tier. Workspace1 contains an all-purpose cluster named cluster1.
You need to reduce the time it takes for cluster1 to start and scale up. The solution must minimize costs.
What should you do first?
A. Configure a global init script for workspace1.
B. Create a cluster policy in workspace1.
C. Upgrade workspace1 to the Premium pricing tier.
D. Create a pool in workspace1.
Answer D is correct. Azure Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use instances.
You can use Databricks Pools to Speed up your Data Pipelines and Scale Clusters Quickly.
Databricks Pools, a managed cache of virtual machine instances that enables clusters to start and scale 4 times faster.
Reference:
https://databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html
Hard
HOTSPOT -
You are building an Azure Stream Analytics job that queries reference data from a product catalog file. The file is updated daily.
The reference data input details for the file are shown in the Input exhibit. (Click the Input tab.)
YOULL HAVE TO GO TO THE WEBSITE FOR THIS IMG
You need to configure the Stream Analytics job to pick up the new reference data.
What should you configure? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Path pattern:
* {date}/product.csv
* {date}/{time}/product.csv
* product.csv
* */product.csv
Date format:
* MM/DD/YYYY
* YYYY/MM/DD
* YYYY-DD-MM
* YYYY-MM-DD
First Box = {date}/product.csv - Because the requirement is reference data loaded on daily basis, so it may be once in a day not hourly or timely.
second box is straight forwarded answer YYYY-MM-DD
THEOFFICIAL ANWSER TO THIS IS NOT CLEAR
HOTSPOT -
You have the following Azure Stream Analytics query.
WITH step1 AS (SELECT * FROM input1 PARTITION BY StateID INTO 10), step2 AS (SELECT * FROM input2 PARTITION BY StateID INTO 10) SELECT * INTO output FROM step1 PARTITION BY StateID UNION SELECT * INTO output FROM step2 PARTITION BY StateID
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.
Hot Area:
The query combines two streams of partitioned data. YES/NO The stream scheme key and count must match the output scheme. YES/NO Providing 60 streaming units will optimize the performance of the query. YES/NO
DEBATED HEAVILY
False (60/40), True (reasonably confident), False (reasonably confident).
https://learn.microsoft.com/en-us/azure/stream-analytics/repartition
The first is False, because this:
“The following example query joins two streams of repartitioned data.”
It’s extracted from the link above, and it’s pointing to our query! Repartitioned and not partitioned.
Second is True, it’s explicitly written
The output scheme should match the stream scheme key and count so that each substream can be flushed independently.
Third is False,
“In general, six SUs are needed for each partition.”
In the example we have 10 positions for step 1 and 10 for step 2, it should be 120 and not 60.
hard
HOTSPOT -
You are building a database in an Azure Synapse Analytics serverless SQL pool.
You have data stored in Parquet files in an Azure Data Lake Storege Gen2 container.
Records are structured as shown in the following sample.
{ "id": 123, "address_housenumber": "19c", "address_line": "Memory Lane", "applicant1_name": "Jane", "applicant2_name": "Dev" }
The records contain two applicants at most.
You need to build a table that includes only the address fields.
How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
SEE IMG ON SITE
XXXXXXXX applications *** CREATE EXTERNAL TABLE * CREATE TABLE * CREATE VIEW** WITH ( LOCATION = 'applications/', DATA_SOURCE = applications_ds, FILE FORMAT = applications_file_format ) AS SELECT id, [address_housenumber] as addresshousenumber, [address_line1] as addressline1 FROM XXXXXXXX (BULK 'https://contoso1.dfs.core.windows.net/applications/year =* / *. parquet', .... *** CROSS APPLY * OPENJSON * OPENROWSET** FORMAT=' PARQUET') AS [r] GO
Box 1: CREATE EXTERNAL TABLE -
An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or serverless SQL pool.
Syntax:
CREATE EXTERNAL TABLE { database_name.schema_name.table_name | schema_name.table_name | table_name }
( <column_definition> [ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name</column_definition>
Box 2. OPENROWSET -
When using serverless SQL pool, CETAS is used to create an external table and export query results to Azure Storage Blob or Azure Data Lake Storage Gen2.
Example:
AS -
SELECT decennialTime, stateName, SUM(population) AS population
FROM -
OPENROWSET(BULK ‘https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=/.parquet’,
FORMAT=’PARQUET’) AS [r]
GROUP BY decennialTime, stateName
GO -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables
Hard
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and an Azure Data Lake Storage Gen2 account named Account1.
You plan to access the files in Account1 by using an external table.
You need to create a data source in Pool1 that you can reference when you create the external table.
How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
CREATE EXTERNAL DATA SOURCE source1 WITH ( LOCATION = 'https://account1.[ blob, dfs, table ].core.windons.net', [PUSHDOWN = ON, TYPE = BLOB_STORAGE, TYPE = HADOOP]
BOX 1: DEBATED, go for dfs i think
Box 2: TYPE = HADOOP
You have an Azure subscription that contains an Azure Blob Storage account named storage1 and an Azure Synapse Analytics dedicated SQL pool named Pool1.
You need to store data in storage1. The data will be read by Pool1. The solution must meet the following requirements:
* Enable Pool1 to skip columns and rows that are unnecessary in a query.
* Automatically create column statistics.
* Minimize the size of files.
Which type of file should you use?
A. JSON
B. Parquet
C. Avro
D. CSV
Correct Answer: B 🗳️
Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of CSV files statistics is supported.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-statistics
DRAG DROP -
You plan to create a table in an Azure Synapse Analytics dedicated SQL pool.
Data in the table will be retained for five years. Once a year, data that is older than five years will be deleted.
You need to ensure that the data is distributed evenly across partitions. The solution must minimize the amount of time required to delete old data.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
VALUES
* CustomerKey
* HASH
* ROUND ROBIN
* REPLICATE
* OrderDateKey
* SalesOrderNumber
SEE SITE FOR IMG
DISTRIBUTION = XXXXXXXXXX ([ProductKey])
PARTITION [ XXXXXXXXX] RANGE RIGHT FOR VALUES
Box 1: HASH -
Box 2: OrderDateKey -
In most cases, table partitions are created on a date column.
A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data early. Then you can switch out the partition with data for an empty partition from another table.
Reference:
VERY SIMILIAR TO ANOTHER QUESTION
HOTSPOT -
You have an Azure Data Lake Storage Gen2 service.
You need to design a data archiving solution that meets the following requirements:
✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.
✑ Data that is older than seven years is NOT accessed.
✑ Costs must be minimized while maintaining the required availability.
How should you manage the data? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Data over five years old:
* Delete the blob.
* Move to archive storage.
* Move to cool storage.
* Move to hot storage.
Data over seven years old:
* Delete the blob.
* Move to archive storage.
* Move to cool storage.
Box 1: Move to cool storage -
Box 2: Move to archive storage -
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.
HOTSPOT -
You plan to create an Azure Data Lake Storage Gen2 account.
You need to recommend a storage solution that meets the following requirements:
✑ Provides the highest degree of data resiliency
✑ Ensures that content remains available for writes if a primary data center fails
What should you include in the recommendation? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Replication mechanism:
* Change feed
* Zone-redundant storage (ZRS)
* Read-access geo-redundant storage (RA-GRS)
* Read-access geo-zone-redundant storage (RA-GRS)
Failover process:
* Failover initiated by Microsoft
* Failover manually initiated by the customer
* Failover automatically initiated by an Azure Automation job
DEBATED HEAVILY
Zone-redundant storage (ZRS)
‘Ensures that content remains available for writes if a primary data center fails’. RA-GRS and RAGZRS provide read access only after failover. The correct answer is ZRS as t=stated in the link below “Microsoft recommends using ZRS in the primary region for Azure Data Lake Storage Gen2 workloads.” https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json
Failover initiated by Microsoft
Failover initiated by Microsoft.
Customer-managed account failover is not yet supported in accounts that have a hierarchical namespace (Azure Data Lake Storage Gen2). To learn more, see Blob storage features available in Azure Data Lake Storage Gen2.
You need to implement a Type 3 slowly changing dimension (SCD) for product category data in an Azure Synapse Analytics dedicated SQL pool.
You have a table that was created by using the following Transact-SQL statement.
CREATE TABLE [DB0] . [DimProduct] ( [ProductKey] [int] IDENTITY (1, 1) NOT NULL, [ProductSourceID] [int] NOT NULL, [ProductNane] [nvarchar] (100) NOT NULL, [Color] [nvarchar] (15) NULL, [SellStartDate] [date] NOT NULL, [SellEndOate] [date] NULL, [RowInsertedDateTime] [datetime] NOT NULL, [RowipdatedDateTine] [datetime] NOT NULL, [ETLAuditID] [int] NOT NULL )
Which two columns should you add to the table? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. [EffectiveEndDate] [datetime] NULL,
B. [CurrentProductCategory] [nvarchar] (100) NOT NULL,
C. [ProductCategory] [nvarchar] (100) NOT NULL,
D. [EffectiveStartDate] [datetime] NOT NULL,
E. [OriginalProductCategory] [nvarchar] (100) NOT NULL,
Correct Answer: BE
A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD.
This type of tracking may be used for one or two columns in a dimension table. It is not common to use it for many members of the same table. It is often used in combination with Type 1 or Type 2 members.
SEE SITE FOR IMG EXPLANTION
https://www.examtopics.com/exams/microsoft/dp-203/view/6/
Reference:
https://k21academy.com/microsoft-azure/azure-data-engineer-dp203-q-a-day-2-live-session-review/
Hard
DRAG DROP -
You have an Azure subscription.
You plan to build a data warehouse in an Azure Synapse Analytics dedicated SQL pool named pool1 that will contain staging tables and a dimensional model.
Pool1 will contain the following tables.
|---------------------|----------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------| | Common.Date | 7,300 | New rows inserted yearly | Contains one row per date for the last 20 years. Contains columns named Year, Month, Quarter, and IsWeekend. | | Marketing.WebSessions | 1,500,500,000 | Hourly inserts and updates | Fact table that contains counts of and updates sessions and page views, including foreign key values for date, channel, device, and medium. | | Staging.WebSessions | 300,000 | Hourly truncation and inserts | Staging table for web session data, including descriptive fields for inserts channel, device, and medium. Truncation involved. |
You need to design the table storage for pool1. The solution must meet the following requirements:
✑ Maximize the performance of data loading operations to Staging.WebSessions.
✑ Minimize query times for reporting queries against the dimensional model.
Which type of table distribution should you use for each table? To answer, drag the appropriate table distribution types to the correct tables. Each table distribution type may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:
Values
* Hash
* repplicated
* round-robin
Common.Data: xxxxxxxx Marketing.Web.Sessions: xxxxxxxxxx Staging. Web.Sessions: xxxxxxxxxxxx
Name | Number of rows | Update frequency | Description |
Box 1: Replicated -
The best table storage option for a small table is to replicate it across all the Compute nodes.
Box 2: Hash -
Hash-distribution improves query performance on large fact tables.
Box 3: Round-robin -
Round-robin distribution is useful for improving loading speed.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
Replicated (Because its a Dimension table)
Hash (Fact table with High volume of data)
Round-Robin (Staging table)
Hard
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool.
You need to create a table named FactInternetSales that will be a large fact table in a dimensional model. FactInternetSales will contain 100 million rows and two columns named SalesAmount and OrderQuantity. Queries executed on FactInternetSales will aggregate the values in SalesAmount and OrderQuantity from the last year for a specific product. The solution must minimize the data size and query execution time.
How should you complete the code? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
CREATE TABLE [dbo]. [FactInternetSales] ( [ProductKey] int NOT NULL , [OrderDatekey] int NOT NULL , [CustomerKey] int NOT NULL , [Promotionkey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [OrderQuantity] smallint NOT NULL , [UnitPrice] money NOT NULL , [SalesAmount] money NOT NULL WITH [ (CLUSTERED COLUMNSTORE INDEX / ( CLUSTERED INDEX ([OrderDateKey]) / (HEAP / ( INDEX on [ProductKey] / ] , DISTRIBUTION = [Hash([OrderDateKey]) / Hash([ProductKey]) / REPLICATE / ROUND_ROBIN] );
Box 1: (CLUSTERED COLUMNSTORE INDEX
CLUSTERED COLUMNSTORE INDEX -
Columnstore indexes are the standard for storing and querying large data warehousing fact tables. This index uses column-based data storage and query processing to achieve gains up to 10 times the query performance in your data warehouse over traditional row-oriented storage. You can also achieve gains up to 10 times the data compression over the uncompressed data size. Beginning with SQL Server 2016 (13.x) SP1, columnstore indexes enable operational analytics: the ability to run performant real-time analytics on a transactional workload.
Note: Clustered columnstore index
A clustered columnstore index is the physical storage for the entire table.
To reduce fragmentation of the column segments and improve performance, the columnstore index might store some data temporarily into a clustered index called a deltastore and a B-tree list of IDs for deleted rows. The deltastore operations are handled behind the scenes. To return the correct query results, the clustered columnstore index combines query results from both the columnstore and the deltastore.
Box 2: HASH([ProductKey])
A hash distributed table distributes rows based on the value in the distribution column. A hash distributed table is designed to achieve high performance for queries on large tables.
Choose a distribution column with data that distributes evenly
Incorrect:
* Not HASH([OrderDateKey]). Is not a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work
* A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated tables since joins on replicated tables don’t require data movement. Replication requires extra storage, though, and isn’t practical for large tables.
* A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly. Loading data into a round-robin table is fast. Keep in mind that queries can require more data movement than the other distribution methods.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
You have an Azure Synapse Analytics dedicated SQL pool that contains a table named Table1. Table1 contains the following:
✑ One billion rows
✑ A clustered columnstore index
✑ A hash-distributed column named Product Key
✑ A column named Sales Date that is of the date data type and cannot be null
Thirty million rows will be added to Table1 each month.
You need to partition Table1 based on the Sales Date column. The solution must optimize query performance and data loading.
How often should you create a partition?
A. once per month
B. once per year
C. once per day
D. once per week
Debated
Correct Answer: B 🗳️
Remembering that we have data splitted in distribution (60 nodes) and considering that we Need a MINMIUM 1 million rows per distribution, we have:
A. once per month = 30 milion / 60 = 500k record per partition
B. once per year = 360 milion / 60 = 6 milion record per partition
C. once per day = about 1 milion / 60 = 16k record per partition
D. once per week =about 7.5 milion / 60 = 125k record per partition
correct should be B
Need a minimum 1 million rows per distribution. Each table is 60 distributions. 30 millions rows is added each month. Need 2 months to get a minimum of 1 million rows per distribution in a new partition.
Note: When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributions.
Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition
You have an Azure Databricks workspace that contains a Delta Lake dimension table named Table1.
Table1 is a Type 2 slowly changing dimension (SCD) table.
You need to apply updates from a source table to Table1.
Which Apache Spark SQL operation should you use?
A. CREATE
B. UPDATE
C. ALTER
D. MERGE
Correct Answer: D 🗳️
When applying updates to a Type 2 slowly changing dimension (SCD) table in Azure Databricks, the best option is to use the MERGE operation in Apache Spark SQL. This operation allows you to combine the data from the source table with the data in the destination table, and then update or insert the appropriate r
ecords. The MERGE operation provides a powerful and flexible way to handle updates for SCD tables, as it can handle both updates and inserts in a single operation. Additionally, this operation can be performed on Delta Lake tables, which can easily handle the ACID transactions needed for handling SCD updates.
The Delta provides the ability to infer the schema for data input which further reduces the effort required in managing the schema changes. The Slowly Changing
Data(SCD) Type 2 records all the changes made to each key in the dimensional table. These operations require updating the existing rows to mark the previous values of the keys as old and then inserting new rows as the latest values. Also, Given a source table with the updates and the target table with dimensional data,
SCD Type 2 can be expressed with the merge.
Example:
// Implementing SCD Type 2 operation using merge function
customersTable
.as(“customers”)
.merge(
stagedUpdates.as(“staged_updates”),
“customers.customerId = mergeKey”)
.whenMatched(“customers.current = true AND customers.address <> staged_updates.address”)
.updateExpr(Map(
“current” -> “false”,
“endDate” -> “staged_updates.effectiveDate”))
.whenNotMatched()
.insertExpr(Map(
“customerid” -> “staged_updates.customerId”,
“address” -> “staged_updates.address”,
“current” -> “true”,
“effectiveDate” -> “staged_updates.effectiveDate”,
“endDate” -> “null”))
.execute()
}
Reference:
https://www.projectpro.io/recipes/what-is-slowly-changing-data-scd-type-2-operation-delta-table-databricks
When applying updates to a Type 2 slowly changing dimension (SCD) table in Azure Databricks, the best option is to use the MERGE operation in Apache Spark SQL. This operation allows you to combine the data from the source table with the data in the destination table, and then update or insert the appropriate records. The MERGE operation provides a powerful and flexible way to handle updates for SCD tables, as it can handle both updates and inserts in a single operation. Additionally, this operation can be performed on Delta Lake tables, which can easily handle the ACID transactions needed for handling SCD updates.
You are designing an Azure Data Lake Storage solution that will transform raw JSON files for use in an analytical workload.
You need to recommend a format for the transformed files. The solution must meet the following requirements:
✑ Contain information about the data types of each column in the files.
✑ Support querying a subset of columns in the files.
✑ Support read-heavy analytical workloads.
✑ Minimize the file size.
What should you recommend?
A. JSON
B. CSV
C. Apache Avro
D. Apache Parquet
Correct Answer: D 🗳️
Parquet, an open-source file format for Hadoop, stores nested data structures in a flat columnar format.
Compared to a traditional approach where data is stored in a row-oriented approach, Parquet file format is more efficient in terms of storage and performance.
It is especially good for queries that read particular columns from a ג€wideג€ (with many columns) table since only needed columns are read, and IO is minimized.
Incorrect:
Not C:
The Avro format is the ideal candidate for storing data in a data lake landing zone because:
1. Data from the landing zone is usually read as a whole for further processing by downstream systems (the row-based format is more efficient in this case).
2. Downstream systems can easily retrieve table schemas from Avro files (there is no need to store the schemas separately in an external meta store).
3. Any source schema change is easily handled (schema evolution).
Reference:
https://www.clairvoyant.ai/blog/big-data-file-formats
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You modify the files to ensure that each row is less than 1 MB.
Does this meet the goal?
A. Yes
B. No
Correct Answer: A 🗳️
Variations of this question
Solution: You convert the files to compressed delimited text files.
Does this meet the goal? YES
Solution: You copy the files to a table that has a columnstore index.
Does this meet the goal? NO
Solution: You modify the files to ensure that each row is more than 1 MB.
Does this meet the goal? NO
Solution: You modify the files to ensure that each row is less than 1 MB.
Does this meet the goal? YES
Polybase loads rows that are smaller than 1 MB.
Note on Polybase Load: PolyBase is a technology that accesses external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language.
Extract, Load, and Transform (ELT)
Extract, Load, and Transform (ELT) is a process by which data is extracted from a source system, loaded into a data warehouse, and then transformed.
The basic steps for implementing a PolyBase ELT for dedicated SQL pool are:
Extract the source data into text files.
Land the data into Azure Blob storage or Azure Data Lake Store.
Prepare the data for loading.
Load the data into dedicated SQL pool staging tables using PolyBase.
Transform the data.
Insert the data into production tables.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-service-capacity-limits https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview
You plan to create a dimension table in Azure Synapse Analytics that will be less than 1 GB.
You need to create the table to meet the following requirements:
✑ Provide the fastest query time.
✑ Minimize data movement during queries.
Which type of table should you use?
A. replicated
B. hash distributed
C. heap
D. round-robin
Correct Answer: A 🗳️
A replicated table has a full copy of the table accessible on each Compute node. Replicating a table removes the need to transfer data among Compute nodes before a join or aggregation. Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed. 2 GB is not a hard limit.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-guidance-for-replicated-tables
You are designing a dimension table in an Azure Synapse Analytics dedicated SQL pool.
You need to create a surrogate key for the table. The solution must provide the fastest query performance.
What should you use for the surrogate key?
A. a GUID column
B. a sequence object
C. an IDENTITY column
Correct Answer: C 🗳️
Use IDENTITY to create surrogate keys using dedicated SQL pool in AzureSynapse Analytics.
Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity
You have an Azure Data Lake Storage Gen2 account that contains a container named container1. You have an Azure Synapse Analytics serverless SQL pool that contains a native external table named dbo.Table1. The source data for dbo.Table1 is stored in container1. The folder structure of container1 is shown in the following exhibit.
The external data source is defined by using the following statement.
[SEE WEBSITE FOR IMAGES]
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.
When selecting all the rows in dbo.Table1, data from the mydata2.csv file will be returned. YES/NO When selecting all the rows in dbo.Table1, data from the mydata3.csv file will be returned.. YES/NO When selecting all the rows in dbo.Table1, data from the _mydata4.csv file will be returned.. YES/NO
yes, no, no»_space;
Both Hadoop and native external tables
will skip the files with the names that begin with an
underline (_) or a period (.).
You have an Azure Synapse Analytics dedicated SQL pool.
You need to create a fact table named Table1 that will store sales data from the last three years. The solution must be optimized for the following query operations:
- Show order counts by week.
- Calculate sales totals by region.
- Calculate sales totals by product.
- Find all the orders from a given month.
Which data should you use to partition Table1?
A. product
B. month
C. week
D. region
Selected Answer: B
When designing a fact table in a data warehouse, it is important to consider the types of queries that will be run against it. In this case, the queries that need to be optimized include: show order counts by week, calculate sales totals by region, calculate sales totals by product, and find all the orders from a given month.
Partitioning the table by month would be the best option in this scenario as it would allow for efficient querying of data by month, which is necessary for the query operations described above. For example, it would be easy to find all the orders from a given month by only searching the partition for that specific month.
You are designing the folder structure for an Azure Data Lake Storage Gen2 account.
You identify the following usage patterns:
- Users will query data by using Azure Synapse Analytics serverless SQL pools and Azure Synapse Analytics serverless Apache Spark pools.
- Most queries will include a filter on the current year or week.
- Data will be secured by data source.
You need to recommend a folder structure that meets the following requirements:
- Supports the usage patterns
- Simplifies folder security
- Minimizes query times
Which folder structure should you recommend?
A. \DataSource\SubjectArea\YYYY\WW\FileData_YYYY_MM_DD.parquet
B. \DataSource\SubjectArea\YYYY-WW\FileData_YYYY_MM_DD.parquet
C. DataSource\SubjectArea\WW\YYYY\FileData_YYYY_MM_DD.parquet
D. \YYYY\WW\DataSource\SubjectArea\FileData_YYYY_MM_DD.parquet
E. WW\YYYY\SubjectArea\DataSource\FileData_YYYY_MM_DD.parquet
chat GPT: Based on the given usage patterns and requirements, the recommended folder structure would be option B:
\DataSource\SubjectArea\YYYY-WW\FileData_YYYY_MM_DD.parquet
This structure allows for easy filtering of data by year and week, which aligns with the identified usage pattern of most queries filtering by the current year or week. It also organizes the data by data source and subject area, which simplifies folder security. By using a flat structure, with the data files directly under the year-week folder, query times can be minimized as the data is organized for efficient partition pruning.
Option A is similar but includes an additional level of hierarchy for the year, which is unnecessary given the requirement to filter by year-week. Options C, D, and E do not follow a consistent hierarchy, making it difficult to navigate and locate specific data files.
You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a table named table1.
You load 5 TB of data into table1.
You need to ensure that columnstore compression is maximized for table1.
Which statement should you execute?
A. DBCC INDEXDEFRAG (pool1, table1)
B. DBCC DBREINDEX (table1)
C. ALTER INDEX ALL on table1 REORGANIZE
D. ALTER INDEX ALL on table1 REBUILD
D. ALTER INDEX ALL on table1 REBUILD
This statement will rebuild all indexes on table1, which can help to maximize columnstore compression. The other options are not appropriate for this task.
DBCC INDEXDEFRAG (pool1, table1) is for defragmenting the indexes and DBCC DBREINDEX (table1) is for recreating the indexes. ALTER INDEX ALL on table1 REORGANIZE is for reorganizing the indexes.
You have an Azure Synapse Analytics dedicated SQL pool named pool1.
You plan to implement a star schema in pool and create a new table named DimCustomer by using the following code.
CREATE TABLE dbo. [DimCustomer]( [CustomerKey] int NOT NULL, [CustomerSourceID] [int] NOT NULL, [Title] [nvarchar](8) NULL, [FirstName] [nvarchar](50) NOT NULL, [MiddleName] [nvarchar](50) NULL, [LastName] [nvarchar](50) NOT NULL, [Suffix] [nvarchar](10) NULL, [CompanyName] [nvarchar](128) NULL, [SalesPerson] [nvarchar](256) NULL, [EmailAddress] [nvarchar](50) NULL, [Phone] [nvarchar](25) NULL, [InsertedDate] [datetime] NOT NULL, [ModifiedDate] [datetime] NOT NULL, [HashKey] [varchar](100) NOT NULL, [IsCurrentRow] [bit] NOT NULL ) WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX ); GO
You need to ensure that DimCustomer has the necessary columns to support a Type 2 slowly changing dimension (SCD).
Which two columns should you add? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. [HistoricalSalesPerson] [nvarchar] (256) NOT NULL
B. [EffectiveEndDate] [datetime] NOT NULL
C. [PreviousModifiedDate] [datetime] NOT NULL
D. [RowID] [bigint] NOT NULL
E. [EffectiveStartDate] [datetime] NOT NULL
DEBATED
Selected Answer: BE
The date of insertion and the expiration date from when to when is something else. You can insert data now, but either with future validity or with past validity (correcting errors, for example).
So options : BE
Hard
HOTSPOT
You have an Azure subscription that contains an Azure Synapse Analytics dedicated SQL pool.
You plan to deploy a solution that will analyze sales data and include the following:
- A table named Country that will contain 195 rows
- A table named Sales that will contain 100 million rows
- A query to identify total sales by country and customer from the past 30 days
You need to create the tables. The solution must maximize query performance.
How should you complete the script? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
see site for context
DISTRIBUTION =
* HASH([Customerld])
* HASH([OrderDate])
* REPLICATE
* ROUND_ROBIN
DISTRIBUTION =
* HASH([CountryCode])
* HASH([Countryld])
* REPLICATE
* ROUND_ROBIN
1. Hash(CustomerID)
2. Replicate
It is hash because it is a fact table (you can tell because there is the “total” column being created which is numerical). Rule of thumb, never hash on a date field, so in this case you would hash on ‘CustomerID’. You want the hash to have as many unique values as possible.
You have an Azure subscription that contains an Azure Data Lake Storage Gen2 account named account1 and an Azure Synapse Analytics workspace named workspace1.
You need to create an external table in a serverless SQL pool in workspace1. The external table will reference CSV files stored in account1. The solution must maximize performance.
How should you configure the external table?
A. Use a native external table and authenticate by using a shared access signature (SAS).
B. Use a native external table and authenticate by using a storage account key.
C. Use an Apache Hadoop external table and authenticate by using a shared access signature (SAS).
D. Use an Apache Hadoop external table and authenticate by using a service principal in Microsoft Azure Active Directory (Azure AD), part of Microsoft Entra.
Correct Answer: A 🗳️
Correct! Serverless SQL Pools cannot use Hadoop, Only Native. Access Key Auth is never best practice therefore leaving only A as a viable answer.
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop
The other options provided (B, C, and D) are not the recommended configurations for maximizing performance in this scenario. Using a storage account key for authentication (option B) poses a security risk and should be avoided. Apache Hadoop external tables (options C and D) do not provide the same level of performance optimization as native external tables in Azure Synapse Analytics.
HOTSPOT
You have an Azure Synapse Analytics serverless SQL pool that contains a database named db1. The data model for db1 is shown in the following exhibit.
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the exhibit.
NOTE: Each correct selection is worth one point.
To convert the data model to a star schema, [answer choice].
* join DimGeography and DimCustomer
* join DimGeography and FactOrders
* union DimGeography and DimCustomer
* union DimGeography and FactOrders
Once the data model is converted into a star schema, there will be [answer choice] tables.
* 4
* 5
* 6
* 7
Correct answer should be join DimGeography and DimCustomer and 5 tables.
You also need to combine ProductLine and Product in order for the schema to be considered a star schema. This would result in 5 remaining tables: DimCustomer (DimCustomer JOIN DimGeography), DimStore, Date, Product (Product JOIN ProductLine) and FactOrders.
You have an Azure Databricks workspace and an Azure Data Lake Storage Gen2 account named storage1.
New files are uploaded daily to storage1.
You need to recommend a solution that configures storage1 as a structured streaming source. The solution must meet the following requirements:
- Incrementally process new files as they are uploaded to storage1.
- Minimize implementation and maintenance effort.
- Minimize the cost of processing millions of files.
- Support schema inference and schema drift.
Which should you include in the recommendation?
A. COPY INTO
B. Azure Data Factory
C. Auto Loader
D. Apache Spark FileStreamSource
Correct Answer: C 🗳️
Bing explains the following:
The best option is C. Auto Loader.
Auto Loader is a feature in Azure Databricks that uses a cloudFiles data source to incrementally and efficiently process new data files as they arrive in Azure Data Lake Storage Gen2. It supports schema inference and schema evolution (drift). It also minimizes implementation and maintenance effort, as it simplifies the ETL pipeline by reducing the complexity of identifying new files for processing.
Other options do not meet the requirements because:
A. COPY INTO: does not incrementally process new files as they are uploaded, which is one of your requirements.
B. Azure Data Factory: does not natively support schema inference and schema drift. The incremental processing of new files would need to be manually implemented, which could increase implementation and maintenance effort.
D. Apache Spark FileStreamSource: requires manual setup and does not natively support schema inference or schema drift. It also may not minimize the cost of processing millions of files as efficiently as Auto Loader.
HARD
You have an Azure subscription that contains the resources shown in the following table.
|-----------|-------------------------|--------------------------------------------------| | storage1 | Azure Blob storage account | Contains publicly accessible TSV files that do NOT have a header row | | WS1 | Azure Synapse Analytics workspace | Contains a serverless SQL pool |
You need to read the TSV files by using ad-hoc queries and the OPENROWSET function. The solution must assign a name and override the inferred data type of each column.
What should you include in the OPENROWSET function?
A. the WITH clause
B. the ROWSET_OPTIONS bulk option
C. the DATAFILETYPE bulk option
D. the DATA_SOURCE parameter
Name | Type | Description |
To read TSV files without a header row using the OPENROWSET
function and to assign a name and specify the data type for each column, you should use:
A. the WITH clause
The WITH clause is used in the OPENROWSET
function to define the format file or to directly define the structure of the file by specifying the column names and data types.
You have an Azure Synapse Analytics dedicated SQL pool.
You plan to create a fact table named Table1 that will contain a clustered columnstore index.
You need to optimize data compression and query performance for Table1.
What is the minimum number of rows that Table1 should contain before you create partitions?
A. 100,000
B. 600,000
C. 1 million
D. 60 million
Clustered Column Store will by default have 60 partitions. And to achieve best compression we need at least 1 Million rows per partition, hence Option D 60 Millions (1M per partition)
You have an Azure Synapse Analytics dedicated SQL pool that contains a table named DimSalesPerson. DimSalesPerson contains the following columns:
- RepSourceID
- SalesRepID
- FirstName
- LastName
- StartDate
- EndDate
- Region
You are developing an Azure Synapse Analytics pipeline that includes a mapping data flow named Dataflow1. Dataflow1 will read sales team data from an external source and use a Type 2 slowly changing dimension (SCD) when loading the data into DimSalesPerson.
You need to update the last name of a salesperson in DimSalesPerson.
Which two actions should you perform? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. Update three columns of an existing row.
B. Update two columns of an existing row.
C. Insert an extra row.
D. Update one column of an existing row.
DISPUTED
Correct Answer: CD 🗳️
Selected Answer: BC
1) Insert an extra row with the updated last name and the current date as the StartDate.
2) Update two columns of an existing row: set the EndDate of the previous row for that salesperson to the current date and set the current value of the SalesRepID column to inactive
HOTSPOT
You plan to use an Azure Data Lake Storage Gen2 account to implement a Data Lake development environment that meets the following requirements:
- Read and write access to data must be maintained if an availability zone becomes unavailable.
- Data that was last modified more than two years ago must be deleted automatically.
- Costs must be minimized.
What should you configure? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
For storage redundancy:
* Geo-zone-redundant storage (GZRS)
* Locally-redundant storage (LRS)
* Zone-redundant storage (ZRS)
For data deletion:
* A lifecycle management policy
* Soft delete
* Versioning
Statement 1: For Storage redundancy, you should select ZRS (Zone-redundant storage). This will maintain read and write access to data even if an availability zone becomes unavailable.
Statement 2: For data deletion, you should select A lifecycle management policy. This will allow you to automatically delete data that was last modified more than two years ago
HOTSPOT
You are developing an Azure Synapse Analytics pipeline that will include a mapping data flow named Dataflow1. Dataflow1 will read customer data from an external source and use a Type 1 slowly changing dimension (SCD) when loading the data into a table named DimCustomer in an Azure Synapse Analytics dedicated SQL pool.
You need to ensure that Dataflow1 can perform the following tasks:
- Detect whether the data of a given customer has changed in the DimCustomer table.
- Perform an upsert to the DimCustomer table.
Which type of transformation should you use for each task? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Detect whether the data of a given customer has changed in the DimCustomer table:
* Aggregate
* Derived column
* Surrogate key
Perform an upsert to the DimCustomer table:
* Alter row
* Assert
* Cast
The answer is correct. Check “Exercise - Design and implement a Type 1 slowly changing dimension with mapping data flows”, there is described implementation of the dataflow mentioned in this question.
https://learn.microsoft.com/en-us/training/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/4-exercise-design-implement-type-1-dimension
In the exercise ‘Derived column’ transformation is used to add InsertedDate and ModifiedDate columns. ModifiedDate column can be used to detect whether the customer data has changed. For Upsert ‘Alter row’ tranformation is used. The answer is definitely correct.
DRAG DROP
You have an Azure Synapse Analytics serverless SQL pool.
You have an Azure Data Lake Storage account named adls1 that contains a public container named container1. The container1 container contains a folder named folder1.
You need to query the top 100 rows of all the CSV files in folder1.
How should you complete the query? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point
VALUES
* BULK
* DATA_SOURCE
* LOCATION
* OPENROWSET
SELECT TOP 100 * FROM XXXXXXXXX ( XXXXXXXXXXX 'https://adls1.dfs.core.windows.net/container1/folder1/ *. csv', FORMAT = 'CSV') AS rows
OPENROWSET
Bulk
You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1.
You plan to create a database named DB1 in Pool1.
You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL pool.
Which format should you use for the tables in DB1?
A. Parquet
B. ORC
C. JSON
D. HIVE
Selected Answer: A
Parquet is supported by serverless SQL pool
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-parquet-files
You have an Azure Data Lake Storage Gen2 account named storage1.
You plan to implement query acceleration for storage1.
**Which two file types support query acceleration? ** Each correct answer presents a complete solution.
NOTE: Each correct selection is worth one point.
A. JSON
B. Apache Parquet
C. XML
D. CSV
E. Avro
Selected Answer: AD
Query acceleration supports CSV and JSON formatted data as input to each request.
You have an Azure subscription that contains the resources shown in the following table.
|-----------|-------------------------|--------------------------------------------------| | storage1 | Azure Blob storage account | Contains publicly accessible JSON files | | WS1 | Azure Synapse Analytics workspace | Contains a serverless SQL pool |
You need to read the files in storage1 by using ad-hoc queries and the OPENROWSET function. The solution must ensure that each rowset contains a single JSON record.
To what should you set the FORMAT option of the OPENROWSET function?
A. JSON
B. DELTA
C. PARQUET
D. CSV
Name | Type | Description |
JSON we sue CSV for
Chat says D: CSV
#81
HOTSPOT
You are designing an Azure Data Lake Storage Gen2 container to store data for the human resources (HR) department and the operations department at your company.
You have the following data access requirements:
- After initial processing, the HR department data will be retained for seven years and rarely accessed.
- The operations department data will be accessed frequently for the first six months, and then accessed once per month.
You need to design a data retention solution to meet the access requirements. The solution must minimize storage costs.
What should you include in the storage policy for each department? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
HR:
* Archive storage after one day and delete storage after 2,555 days.
* Archive storage after 2,555 days.
* Cool storage after one day.
* Cool storage after 180 days.
* Cool storage after 180 days and delete storage after 2,555 days.
* Delete after one day.
* Delete after 180 days.
Operations:
* Archive storage after one day and delete storage after 2,555 days.
* Archive storage after 2,555 days.
* Cool storage after one day.
* Cool storage after 180 days.
* Cool storage after 180 days and delete storage after 2,555 days.
* Delete after one day.
* Delete after 180 days.
Archive storage after one day and delete storage after 2,555 days.
Cool storage after 180 days.
The answer for HR depends on the meaning of “rarely” and the duration of “initial processing”. If rarely is like once a year and initial processing is complete within 24 h the answer is correct. If rarely is like on a weekly basis, archiv might be the wrong way
You have an Azure subscription that contains the Azure Synapse Analytics workspaces shown in the following table.
| Name | Primary storage account | |------------|-------------------------| | workspace1 | datalake1 | | workspace2 | datalake2 | | workspace3 | datalake1 |
Each workspace must read and write data to datalake1.
Each workspace contains an unused Apache Spark pool.
You plan to configure each Spark pool to share catalog objects that reference datalake1.
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.
* The shared catalog objects can be stored in Azure Database for MySQL. YES/NO * For the Apache Hive Metastore of each workspace, you must configure a linked service that uses user-password authentication. YES/NO * The users of workspace1 must be assigned the Storage Blob Contributor role for datalake1. YES/NO
Chat too confusing, using GPT answer
No, No, Yes
-
The shared catalog objects can be stored in Azure Database for MySQL.
Azure Synapse Analytics doesn’t support storing shared catalog objects in Azure Database for MySQL. Instead, it uses an Apache Hive Metastore as a common catalog for multiple workspaces. -
For the Apache Hive Metastore of each workspace, you must configure a linked service that uses user-password authentication.
Azure Synapse Analytics doesn’t require user-password authentication for the Apache Hive Metastore. It typically relies on service principals or managed identities for authentication. -
The users of workspace1 must be assigned the Storage Blob Contributor role for datalake1.
For workspace1 to read and write data to datalake1, users in workspace1 need adequate permissions. The Storage Blob Contributor role provides the necessary permissions to read and write data to Azure Data Lake Storage Gen2, which is datalake1 in this case. Therefore, assigning the Storage Blob Contributor role to users in workspace1 would be appropriate.
CHAT SAYS
Provided answers are correct:
1. Yes:
Azure Synapse Analytics allows Apache Spark pools in the same workspace to share a managed HMS (Hive Metastore) compatible metastore as their catalog. When customers want to persist the Hive catalog metadata outside of the workspace, and share catalog objects with other computational engines outside of the workspace, such as HDInsight and Azure Databricks, they can connect to an external Hive Metastore. Only Azure SQL Database and Azure Database for MySQL are supported as an external Hive Metastore.
2. Yes:
And currently we only support User-Password authentication.
3. No:
And currently we only support User-Password authentication. ==> STORAGE BLOB CONTRIBUTOR is an Azure RBAC (Role-Based Access Control) ==> NOT COMPATIBLE (it is supported User-Password authentication ONLY).
ref.
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-external-metastore
You have a data warehouse.
You need to implement a slowly changing dimension (SCD) named Product that will include three columns named ProductName, ProductColor, and ProductSize. The solution must meet the following requirements:
- Prevent changes to the values stored in ProductName.
- Retain only the current and the last values in ProductSize.
- Retain all the current and previous values in ProductColor.
Which type of SCD should you implement for each column? To answer, drag the appropriate types to the correct columns. Each type may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Values
* Type 0
* Type 1
* Type 2
* Type 3
ProductName: XXXXXXXX
Color: XXXXXXXX
Size: XXXXXXXX
DISPUTED:go with this
ProductName - type 0, as no changes are done.
Color - type 3, as with type 3 we have one column for the current value and one for the previous so only these two are preserved.
Size - type 2, as it inserts a new row for every change, so we get all historical values.
HOTSPOT
You are incrementally loading data into fact tables in an Azure Synapse Analytics dedicated SQL pool.
Each batch of incoming data is staged before being loaded into the fact tables.
You need to ensure that the incoming data is staged as quickly as possible.
How should you configure the staging tables? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Table distribution:
* HASH
* REPLICATE
* ROUND_ROBIN
Table structure:
* Clustered index
* Columnstore index
* Heap
The ROUND_ROBIN distribution distributes the data evenly across all distribution nodes in the SQL pool. This distribution type is suitable for loading data quickly into the staging tables because it minimizes the data movement during the loading process.
Use a HEAP table: Instead of creating a clustered index on the staging table, it is recommended to create a HEAP table. A HEAP table does not have a clustered index, which eliminates the need for maintaining the index and improves the data loading performance. It allows for faster insert operations.
You have an Azure subscription that contains an Azure Synapse Analytics workspace named ws1 and an Azure Cosmos DB database account named Cosmos1. Cosmos1 contains a container named container1 and ws1 contains a serverless SQL pool.
You need to ensure that you can query the data in container1 by using the serverless SQL pool.
Which three actions should you perform? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. Enable Azure Synapse Link for Cosmos1.
B. Disable the analytical store for container1.
C. In ws1, create a linked service that references Cosmos1.
D. Enable the analytical store for container1.
E. Disable indexing for container1.
Correct Answer: ACD 🗳️
A. Enable Azure Synapse Link for Cosmos1.
C. In ws1, create a linked service that references Cosmos1.
D. Enable the analytical store for container1.
HOTSPOT
You have an Azure subscription that contains the resources shown in the following table.
| Name | Type | Description | |------------|-------------------------------------|--------------------------------------------| | Workspace1 | Azure Synapse workspace | Deployed to Workspace1 | | Pool1 | Azure Synapse Analytics dedicated SQL pool | Contains the Built-in serverless SQL pool | | storage1 | Storage account | Hierarchical namespace enabled |
The storage1 account contains a container named container1. The container1 container contains the following files.
Webdata <root folder> Monthly <folder> _monthly.csv Monthly.csv .testdata.csv testdata.csv
In Pool1, you run the following script.
CREATE EXTERNAL DATA SOURCE Ds1 WITH ( LOCATION = 'abfss://container1@storage1.dfs.core.windows.net' , CREDENTIAL = credential1, TYPE = HADOOP );
In the Built-in serverless SQL pool, you run the following script.
CREATE EXTERNAL DATA SOURCE Ds2 WITH ( LOCATION = 'https://storage1.blob.core.windows.net/container1/Webdata/', CREDENTIAL = credential2 ):
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.
An external table that uses Ds1 can read the_monthly.csv file. Yes/No
An external table that uses Ds1 can read the Monthly.csv file. Yes/No
An external table that uses Ds2 can read the .testdata.csv file. Yes/No
No,
Yes,
No
It will ignore “_” and “.”
You have an Azure subscription that contains an Azure Data Lake Storage Gen2 account named account1 and a user named User1.
In account1, you create a container named container1. In container1, you create a folder named folder1.
You need to ensure that User1 can list and read all the files in folder1. The solution must use the principle of least privilege.
How should you configure the permissions for each folder? To answer, drag the appropriate permissions to the correct folders. Each permission may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Values
* Execute
* Read
* Read and Write
* None
* Read and Execute
* Write
container1/: XXXXXXXXXXXXX
container1/folder1: XXXXXXXXXX
….
Box 1: Execute
Box 2: Read & Execute
https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control#levels-of-permission
You have an Azure Data Factory pipeline named pipeline1.
You need to execute pipeline1 at 2 AM every day. The solution must ensure that if the trigger for pipeline1 stops, the next pipeline execution will occur at 2 AM, following a restart of the trigger.
Which type of trigger should you create?
A. schedule
B. tumbling
C. storage event
D. custom event
GPT Says
The scenario requires a trigger that ensures the execution of the pipeline at a specific time daily and has the capability to resume or continue executions even after a trigger stoppage or restart. For this purpose:
A. Schedule
A schedule trigger in Azure Data Factory allows you to specify a recurring execution time, such as running a pipeline at 2 AM every day. Additionally, if the trigger stops due to any reason, it will resume its schedule, and the next execution will occur at the defined time (in this case, 2 AM) following a trigger restart.
Some commenets claim ir might be tumbling
HOTSPOT
You have an Azure data factory named adf1 that contains a pipeline named ExecProduct. ExecProduct contains a data flow named Product.
The Product data flow contains the following transformations:
- WeeklyData: A source that points to a CSV file in an Azure Data Lake Storage Gen2 account with 20 columns
- ProductColumns: A select transformation that selects from WeeklyData six columns named ProductID, ProductDescr, ProductSubCategory, ProductCategory, ProductStatus, and ProductLastUpdated
- ProductRows: An aggregate transformation
- ProductList: A sink that outputs data to an Azure Synapse Analytics dedicated SQL pool
The Aggregate settings for ProductRows are configured as shown in the following exhibit.
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.
There will be six columns in the output of ProductRows. YES/NO
There will always be one output row for each unique value of ProductDescr. YES/NO
There will always be one output row for each unique value of ProductID. YES/NO
GO WITH FOLLOWING, don’t know why
YES
NO
YES
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-aggregate
You manage an enterprise data warehouse in Azure Synapse Analytics.
Users report slow performance when they run commonly used queries. Users do not report performance changes for infrequently used queries.
You need to monitor resource utilization to determine the source of the performance issues.
Which metric should you monitor?
A. DWU limit
B. Cache hit percentage
C. Local tempdb percentage
D. Data IO percentage
B. Cache hit percentage should be correct since it only affects common used queries, which should be saved and loaded from cache.
HOTSPOT
You have an Azure Synapse Analytics serverless SQL pool.
You have an Apache Parquet file that contains 10 columns.
You need to query data from the file. The solution must return only two columns.
How should you complete the query? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Values
[BULK, DELTA, OPENQUERY, SINGLE_BLOB]]
[(Col1 int, Col2 varchar(20)), FILEPATH(2), PARSER_VERSION = ‘2.0’, SINGLE_BLOB]
SELECT * FROM OPENROWSET ([XXXXXXXXXXXX]N'https://myaccount.dfs.core.windows.net/mycontainer/mysubfolder/data.parquet', FORMAT = 'PARQUET') WITH [YYYYYYYYYYYY] as rows
BULK
[(Col1 int, Col2 varchar(20)),
Hard
You have an Azure Synapse Analytics workspace that contains an Apache Spark pool named SparkPool1. SparkPool1 contains a Delta Lake table named SparkTable1.
You need to recommend a solution that supports Transact-SQL queries against the data referenced by SparkTable1. The solution must ensure that the queries can use partition elimination.
What should you include in the recommendation?
A. a partitioned table in a dedicated SQL pool
B. a partitioned view in a dedicated SQL pool
C. a partitioned index in a dedicated SQL pool
D. a partitioned view in a serverless SQL pool
Selected Answer: D
D is correct.
“The OPENROWSET function is not supported in dedicated SQL pools in Azure Synapse.” so it eliminates A,B and C.
Ref: https://learn.microsoft.com/en-us/sql/t-sql/functions/openrowset-transact-sql?view=sql-server-ver16
Only the partitioned view in the serverless sql pool is correct since “External tables in serverless SQL pools do not support partitioning on Delta Lake format. Use Delta partitioned views instead of tables if you have partitioned Delta Lake data sets.”
Ref: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-external-tables#delta-tables-on-partitioned-folders
HARD HARD
You are designing a sales transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will contain approximately 60 million rows per month and will be partitioned by month. The table will use a clustered column store index and round-robin distribution.
Approximately how many rows will there be for each combination of distribution and partition?
A. 1 million
B. 5 million
C. 20 million
D. 60 million
Correct Answer: A 🗳️
Partitioned by month and with 60 nodes, means it’s 1M per combination
You have an Azure Synapse Analytics workspace.
You plan to deploy a lake database by using a database template in Azure Synapse.
Which two elements are included in the template? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. relationships
B. data formats
C. linked services
D. table permissions
E. table definitions
Correct, AE. Only table definition and their relationship is included in the template. The rest of the options should be configured
Ref: https://learn.microsoft.com/en-us/azure/synapse-analytics/database-designer/create-lake-database-from-lake-database-templates
You are implementing a star schema in an Azure Synapse Analytics dedicated SQL pool.
You plan to create a table named DimProduct.
DimProduct must be a Type 3 slowly changing dimension (SCD) table that meets the following requirements:
- The values in two columns named ProductKey and ProductSourceID will remain the same.
- The values in three columns named ProductName, ProductDescription, and Color can change.
You need to add additional columns to complete the following table definition.
CREATE TABLE [dbo].[dimproduct]( [ProductKey] INT NOT NULL, [ProductSourceID] INT NOT NULL, [ProductName] NVARCHAR(100) NOT NULL, [ProductDescription] NVARCHAR(2000) NOT NULL, [Color] NVARCHAR(50) NOT NULL ) WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX ):
Which three columns should you add? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. [EffectiveStartDate] [datetime] NOT NULL
B. [EffectiveEndDate] [datetime] NOT NULL
C. [OriginalProductDescription] NVARCHAR(2000) NOT NULL
D. [IsCurrentRow] [bit] NOT NULL
E. [OriginalColor] NVARCHAR(50) NOT NULL
F. [OriginalProductName] NVARCHAR(100) NULL
Selected Answer: CEF
Correct. The other three options are needed for a scd type 2 table.
HOTSPOT -
You plan to create a real-time monitoring app that alerts users when a device travels more than 200 meters away from a designated location.
You need to design an Azure Stream Analytics job to process the data for the planned app. The solution must minimize the amount of code developed and the number of technologies used.
What should you include in the Stream Analytics job? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Input type:
* Stream
* Reference
Function:
* Aggregate
* Geospatial
* Windowing
The input type for the Stream Analytics job should be Stream, as it will be processing real-time data from devices.
The function to include in the Stream Analytics job should be Geospatial, which allows you to perform calculations on geographic data and make spatial queries, such as determining the distance between two points. This is necessary to determine if a device has traveled more than 200 meters away from a designated location.
HARD
A company has a real-time data analysis solution that is hosted on Microsoft Azure. The solution uses Azure Event Hub to ingest data and an Azure Stream
Analytics cloud job to analyze the data. The cloud job is configured to use 120 Streaming Units (SU).
You need to optimize performance for the Azure Stream Analytics job.
Which two actions should you perform? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. Implement event ordering.
B. Implement Azure Stream Analytics user-defined functions (UDF).
C. Implement query parallelization by partitioning the data output.
D. Scale the SU count for the job up.
E. Scale the SU count for the job down.
F. Implement query parallelization by partitioning the data input.
Disputed 50/50
Correct Answer: DF 🗳️
D: Scale out the query by allowing the system to process each input partition separately.
F: A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization
HOWEVER
Dicussion has the most likes (61) for CF go for this
C: Implement query parallelization by partitioning the data output.
F: Implement query parallelization by partitioning the data input.
You need to trigger an Azure Data Factory pipeline when a file arrives in an Azure Data Lake Storage Gen2 container.
Which resource provider should you enable?
A. Microsoft.Sql
B. Microsoft.Automation
C. Microsoft.EventGrid
D. Microsoft.EventHub
Correct Answer: C 🗳️
Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the arrival or deletion of a file in Azure
Blob Storage account. Data Factory natively integrates with Azure Event Grid, which lets you trigger pipelines on such events.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers
You plan to perform batch processing in Azure Databricks once daily.
Which type of Databricks cluster should you use?
A. High Concurrency
B. automated
C. interactive
Correct Answer: B 🗳️
Automated Databricks clusters are the best for jobs and automated batch processing.
Note: Azure Databricks has two types of clusters: interactive and automated. You use interactive clusters to analyze data collaboratively with interactive notebooks. You use automated clusters to run fast and robust automated jobs.
Example: Scheduled batch workloads (data engineers running ETL jobs)
This scenario involves running batch job JARs and notebooks on a regular cadence through the Databricks platform.
The suggested best practice is to launch a new cluster for each run of critical jobs. This helps avoid any issues (failures, missing SLA, and so on) due to an existing workload (noisy neighbor) on a shared cluster.
Reference:
https://docs.microsoft.com/en-us/azure/databricks/clusters/create https://docs.databricks.com/administration-guide/cloud-configurations/aws/cmbp.html#scenario-3-scheduled-batch-workloads-data-engineers-running-etl-jobs
HOTSPOT -
You are processing streaming data from vehicles that pass through a toll booth.
You need to use Azure Stream Analytics to return the license plate, vehicle make, and hour the last vehicle passed during each 10-minute window.
How should you complete the query? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
WITH LastInWindow AS ( SELECT <COUNT / MAX / MIN / TOPONE>(Time) AS LastEventTime FROM Input TIMESTAMP BY Time GROUP BY <HoppingWindow / SessionWindow / SlidingWindow / TumblingWindow>(minute, 10) ) SELECT Input.License_plate, Input.Make, Input.Time FROM Input TIMESTAMP BY Time INNER JOIN LastInWindow ON <DATEADD / DATEDIFF / DATENAME / DATEPART>(minute, Input, LastInWindow) BETWEEN 0 AND 10 AND Input.Time = LastInWindow.LastEventTime
100% correct
Box 1: MAX -
The first step on the query finds the maximum time stamp in 10-minute windows, that is the time stamp of the last event for that window. The second step joins the results of the first query with the original stream to find the event that match the last time stamps in each window.
Box 2: TumblingWindow -
Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.
Box 3: DATEDIFF -
DATEDIFF is a date-specific function that compares and returns the time difference between two DateTime fields, for more information, refer to date functions.
Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics
hard link
You have an Azure Data Factory instance that contains two pipelines named Pipeline1 and Pipeline2.
Pipeline1 has the activities shown in the following exhibit.
~~~
_________________________
|————————-|
| Stored procedure1 |
|_________________________|
—>
_________________________
| Set variable |
|————————-|
| (x) Set variable1 |
|_________________________|
~~~
Pipeline2 has the activities shown in the following exhibit.
~~~
_________________________ _________________________
| Execute pipeline | | Set variable |
|————————-| —> |————————-|
| Execute pipeline1 | | (x) Set variable1 |
|_________________________| |_________________________|
~~~
You execute Pipeline2, and Stored procedure1 in Pipeline1 fails.
What is the status of the pipeline runs?
A. Pipeline1 and Pipeline2 succeeded.
B. Pipeline1 and Pipeline2 failed.
C. Pipeline1 succeeded and Pipeline2 failed.
D. Pipeline1 failed and Pipeline2 succeeded.
Stored procedure |
Correct Answer: A 🗳️
Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or Completed.
Consider Pipeline1:
If we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1 failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory.
The failure dependency means this pipeline reports success.
Note:
If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is successful. In this scenario, if Activity1 fails, the pipeline will fail.
Reference:
https://datasavvy.me/category/azure-data-factory/
HOTSPOT -
A company plans to use Platform-as-a-Service (PaaS) to create the new data pipeline process. The process must meet the following requirements:
Ingest:
✑ Access multiple data sources.
✑ Provide the ability to orchestrate workflow.
✑ Provide the capability to run SQL Server Integration Services packages.
Store:
✑ Optimize storage for big data workloads.
✑ Provide encryption of data at rest.
✑ Operate with no size limits.
Prepare and Train:
✑ Provide a fully-managed and interactive workspace for exploration and visualization.
✑ Provide the ability to program in R, SQL, Python, Scala, and Java.
Provide seamless user authentication with Azure Active Directory.
Model & Serve:
✑ Implement native columnar storage.
✑ Support for the SQL language
✑ Provide support for structured streaming.
You need to build the data integration pipeline.
Which technologies should you use? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Ingest
- Logic Apps
- Azure Data Factory
- Azure Automation
Store
- Azure Data Lake Storage
- Azure Blob storage
- Azure files
Prepare and Train
- HDInsight Apache Spark cluster
- Azure Databricks
- HDInsight Apache Storm cluster
Model and Serve
- HDInsight Apache Kafka cluster
- Azure Synapse Analytics
- Azure Data Lake Storage
Ingest: Azure Data Factory -
Azure Data Factory pipelines can execute SSIS packages.
In Azure, the following services and tools will meet the core requirements for pipeline orchestration, control flow, and data movement: Azure Data Factory, Oozie on HDInsight, and SQL Server Integration Services (SSIS).
Store: Data Lake Storage -
Data Lake Storage Gen1 provides unlimited storage.
Note: Data at rest includes information that resides in persistent storage on physical media, in any digital format. Microsoft Azure offers a variety of data storage solutions to meet different needs, including file, disk, blob, and table storage. Microsoft also provides encryption to protect Azure SQL Database, Azure Cosmos
DB, and Azure Data Lake.
Prepare and Train: Azure Databricks
Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration.
With Azure Databricks, you can set up your Apache Spark environment in minutes, autoscale and collaborate on shared projects in an interactive workspace.
Azure Databricks supports Python, Scala, R, Java and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch and scikit-learn.
Model and Serve: Azure Synapse Analytics
Azure Synapse Analytics/ SQL Data Warehouse stores data into relational tables with columnar storage.
Azure SQL Data Warehouse connector now offers efficient and scalable structured streaming write support for SQL Data Warehouse. Access SQL Data
Warehouse from Azure Databricks using the SQL Data Warehouse connector.
Note: As of November 2019, Azure SQL Data Warehouse is now Azure Synapse Analytics.
Reference:
https://docs.microsoft.com/bs-latn-ba/azure/architecture/data-guide/technology-choices/pipeline-orchestration-data-movement https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks
DRAG DROP -
You have the following table named Employees.
You need to calculate the employee_type value based on the hire_date value.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:
Values
* CASE
* ELSE
* OVER
* PARTITION BY
* ROW_NUMBER
SELECT *, XXXXXXXXXX WHEN hire_date >= '2019-01-01' THEN 'New' XXXXXXXXXX 'Standard' END AS employee_type FROM employees
Box 1: CASE -
CASE evaluates a list of conditions and returns one of multiple possible result expressions.
CASE can be used in any statement or clause that allows a valid expression. For example, you can use CASE in statements such as SELECT, UPDATE,
DELETE and SET, and in clauses such as select_list, IN, WHERE, ORDER BY, and HAVING.
Syntax: Simple CASE expression:
CASE input_expression -
WHEN when_expression THEN result_expression [ …n ]
[ ELSE else_result_expression ]
END -
Box 2: ELSE -
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/case-transact-sql
HARD HARD
DRAG DROP -
You have an Azure Synapse Analytics workspace named WS1.
You have an Azure Data Lake Storage Gen2 container that contains JSON-formatted files in the following format.
{ "id": "66532691-ab20-11ea-8b1d-936b3ec64e54", "context": { "data": { "eventTime": "2020-06-10T13:43:34.5532", "samplingRate": "100.0", "isSynthetic": "false" }, "session": { "isFirst": "false", "id": "38619c14-7a23-4687-8268-95862c5326b1" }, "custom": { "dimensions": [ { "customerInfo": { "ProfileType": "ExpertUser", "RoomName": "", "CustomerName": "diamond", "UserName": "XXXX@yahoo.com" } }, { "customerInfo": { "ProfileType": "Novice", "RoomName": "", "CustomerName": "topaz", "UserName": "xxxx@outlook.com" } } ] } } }
You need to use the serverless SQL pool in WS1 to read the files.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:
Values
* opendatasource
* openjson
* openquery
* openrowset
SELECT * FROM [XXXXXXXXXX]( BULK 'https://contoso.blob.core.windows.net/contosodw', FORMAT = 'CSV', FIELDTERMINATOR = '0x0b', FIELDQUOTE = '0x0b', ROWTERMINATOR = '0x0b' ) AS q WITH ( id VARCHAR(50), contextdateventTime VARCHAR(50) '$.context.data.eventTime', contextdatasamplingRate VARCHAR(50) '$.context.data.samplingRate', contextdataisSynthetic VARCHAR(50) '$.context.data.isSynthetic', contextsessionisFirst VARCHAR(50) '$.context.session.isFirst', contextsession VARCHAR(50) '$.context.session.id', contextcustomdimensions VARCHAR(MAX) '$.context.custom.dimensions' ) AS q1 CROSS APPLY [XXXXXXXXXX] (contextcustomdimensions) WITH ( ProfileType VARCHAR(50) '$.customerInfo.ProfileType', RoomName VARCHAR(50) '$.customerInfo.RoomName', CustomerName VARCHAR(50) '$.customerInfo.CustomerName', UserName VARCHAR(50) '$.customerInfo.UserName' ) AS q2;
Box 1: openrowset -
The easiest way to see to the content of your CSV file is to provide file URL to OPENROWSET function, specify csv FORMAT.
Box 2: openjson -
You can access your JSON files from the Azure File Storage share by using the mapped drive, as shown in the following example:
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-single-csv-file https://docs.microsoft.com/en-us/sql/relational-databases/json/import-json-documents-into-sql-server
DRAG DROP -
You have an Apache Spark DataFrame named temperatures. A sample of the data is shown in the following table.
go to site for img
You need to produce the following table by using a Spark SQL query.
How should you complete the query? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all.
You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:
Values
- CAST
- COLLATE
- CONVERT
- FLATTEN
- PIVOT
- UNPIVOT
SELECT * FROM ( SELECT YEAR (Date) Year, MONTH (Date) Month, Temp FROM temperatures WHERE date BETWEEN DATE '2019-01-01' AND DATE '2021-08-31' ) [XXXXXXXXXX] ( AVG ( [XXXXXXXXXX] (Temp AS DECIMAL(4, 1))) FOR Month in ( 1 JAN, 2 FEB, 3 MAR, 4 APR, 5 MAY, 6 JUN, 7 JUL, 8 AUG, 9 SEP, 10 OCT, 11 NOV, 12 DEC ) ) ORDER BY Year ASC
Box 1: PIVOT -
PIVOT rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output. And PIVOT runs aggregations where they’re required on any remaining column values that are wanted in the final output.
Incorrect Answers:
UNPIVOT carries out the opposite operation to PIVOT by rotating columns of a table-valued expression into column values.
Box 2: CAST -
If you want to convert an integer value to a DECIMAL data type in SQL Server use the CAST() function.
Example:
SELECT -
CAST(12 AS DECIMAL(7,2) ) AS decimal_value;
Here is the result:
decimal_value
12.00
Reference:
https://learnsql.com/cookbook/how-to-convert-an-integer-to-a-decimal-in-sql-server/ https://docs.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot
You have an Azure Data Factory that contains 10 pipelines.
You need to label each pipeline with its main purpose of either ingest, transform, or load. The labels must be available for grouping and filtering when using the monitoring experience in Data Factory.
What should you add to each pipeline?
A. a resource tag
B. a correlation ID
C. a run group ID
D. an annotation
Correct Answer: D 🗳️
Annotations are additional, informative tags that you can add to specific factory resources: pipelines, datasets, linked services, and triggers. By adding annotations, you can easily filter and search for specific factory resources.
Reference:
https://www.cathrinewilhelmsen.net/annotations-user-properties-azure-data-factory/
HOTSPOT -
The following code segment is used to create an Azure Databricks cluster.
{ "num_workers": null, "autoscale": { "min workers": 2, "max_workers": 8 }, "cluster_name": "MyCluster", "spark_version": "latest-stable-scala2.11", "spark_conf": { "spark.databricks.cluster.profile": "serverless", "spark.databricks.repl.allowedLanguages": "sql,python,r" }, "node_type_id": "Standard_DS13_v2", "ssh public_keys": [], "custom_tags": { "ResourceClass": "Serverless" }, "spark_env_vars": { "PYSPARK_PYTHON": "/databricks/python3/bin/python3" }, "autotermination_minutes": 90, "enable_elastic_disk": true, "init_scripts": [] }
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.
Hot Area:
The Databricks cluster supports multiple concurrent users. YES/NO The Databricks cluster minimizes costs when running scheduled jobs that execute notebooks. YES/NO The Databricks cluster supports the creation of a Delta Lake table. YES/NO
1. Yes
A cluster mode of ‘High Concurrency’ is selected, unlike all the others which are ‘Standard’. This results in a worker type of Standard_DS13_v2.
ref: https://adatis.co.uk/databricks-cluster-sizing/
2. NO
recommended: New Job Cluster.
When you run a job on a new cluster, the job is treated as a data engineering (job) workload subject to the job workload pricing. When you run a job on an existing cluster, the job is treated as a data analytics (all-purpose) workload subject to all-purpose workload pricing.
ref: https://docs.microsoft.com/en-us/azure/databricks/jobs
Scheduled batch workload- Launch new cluster via job
ref: https://docs.databricks.com/administration-guide/capacity-planning/cmbp.html#plan-capacity-and-control-cost
3.YES
Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns.
ref: https://docs.databricks.com/delta/index.html
You are designing a statistical analysis solution that will use custom proprietary Python functions on near real-time data from Azure Event Hubs.
You need to recommend which Azure service to use to perform the statistical analysis. The solution must minimize latency.
What should you recommend?
A. Azure Synapse Analytics
B. Azure Databricks
C. Azure Stream Analytics
D. Azure SQL Database
My answer will be B
Stream Analytics supports “extending SQL language with JavaScript and C# user-defined functions (UDFs)”. There is no mention of Python support; hence Stream Analytics is not correct.
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction
Azure Databricks supports near real-time data from Azure Event Hubs. And includes support for R, SQL, Python, Scala, and Java. So I will go for option B.
HOTSPOT -
You have an enterprise data warehouse in Azure Synapse Analytics that contains a table named FactOnlineSales. The table contains data from the start of 2009 to the end of 2012.
You need to improve the performance of queries against FactOnlineSales by using table partitions. The solution must meet the following requirements:
✑ Create four partitions based on the order date.
✑ Ensure that each partition contains all the orders placed during a given calendar year.
How should you complete the T-SQL command? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
CREATE TABLE [dbo] . FactOnlineSales
([OnlineSalesKey] [int] NOT NULL,
[OrderDateKey] [datetime] NOT NULL,
[StoreKey] [int] NOT NULL,
[ProductKey] [int] NOT NULL,
[CustomerKey] [int] NOT NULL,
[SalesOrderNumber] [nvarchar] (20) NOT NULL,
[SalesQuantity] [int] NOT NULL,
[SalesAmount] [money] NOT NULL,
[UnitPrice] [money] NULL)
WITH (CLUSTERED COLUMNSTORE INDEX)
PARTITION ([OrderDateKey] RANGE [RIGHT / LEFT] FOR VALUES
( [ 20090101,20121231 / 20100101,20110101,20120101 / 20090101,20100101,20110101,20120101 ])
Range Left or Right, both are creating similar partition but there is difference in comparison
For example: in this scenario, when you use LEFT and 20100101,20110101,20120101
Partition will be, datecol<=20100101, datecol>20100101 and datecol<=20110101, datecol>20110101 and datecol<=20120101, datecol>20120101
But if you use range RIGHT and 20100101,20110101,20120101
Partition will be, datecol<20100101, datecol>=20100101 and datecol<20110101, datecol>=20110101 and datecol<20120101, datecol>=20120101
In this example, Range RIGHT will be suitable for calendar comparison Jan 1st to Dec 31st
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql?view=sql-server-ver15
You need to implement a Type 3 slowly changing dimension (SCD) for product category data in an Azure Synapse Analytics dedicated SQL pool.
You have a table that was created by using the following Transact-SQL statement.
Which two columns should you add to the table? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. [EffectiveStartDate] [datetime] NOT NULL,
B. [CurrentProductCategory] [nvarchar] (100) NOT NULL,
C. [EffectiveEndDate] [datetime] NULL,
D. [ProductCategory] [nvarchar] (100) NOT NULL,
E. [OriginalProductCategory] [nvarchar] (100) NOT NULL,
Correct Answer: BE 🗳️
A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD.
This type of tracking may be used for one or two columns in a dimension table. It is not common to use it for many members of the same table. It is often used in combination with Type 1 or Type 2 members.
Reference:
https://k21academy.com/microsoft-azure/azure-data-engineer-dp203-q-a-day-2-live-session-review/
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You are designing an Azure Stream Analytics solution that will analyze Twitter data.
You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once.
Solution: You use a hopping window that uses a hop size of 10 seconds and a window size of 10 seconds.
Does this meet the goal?
A. Yes
B. No
Majority beleive The answer should be “Yes”. Hopping window with hop size equals window size should be the same as Tumbling window.
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You are designing an Azure Stream Analytics solution that will analyze Twitter data.
You need to count the tweets in each 10-second window. The solution must ensure that each tweet is counted only once.
Solution: You use a hopping window that uses a hop size of 5 seconds and a window size 10 seconds.
Does this meet the goal?
A. Yes
B. No
Correct Answer: B 🗳️
Instead use a tumbling window. Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.
if the hop size is equivalent to the window size then it can be true, but because the hop size is smaller, then each tweet can be count more than one and the windows will overlap with each others.
Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics
link
HOTSPOT -
You are building an Azure Stream Analytics job to identify how much time a user spends interacting with a feature on a webpage.
The job receives events based on user actions on the webpage. Each row of data represents an event. Each event has a type of either ‘start’ or ‘end’.
You need to calculate the duration between start and end events.
How should you complete the query? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
SELECT [user], feature, [DATEADD ( / DATEDIFF ( / DATEPART ( ] second, [ISFIRST / LAST / TOPONE ] (Time) OVER (PARTITION BY [user], feature LIMIT DURATION (hour, 1) WHEN Event = 'start'), Time) as duration FROM input TIMESTAMP BY Time WHERE Event = 'end'
Box 1: DATEDIFF -
DATEDIFF function returns the count (as a signed integer value) of the specified datepart boundaries crossed between the specified startdate and enddate.
Syntax: DATEDIFF ( datepart , startdate, enddate )
Box 2: LAST -
The LAST function can be used to retrieve the last event within a specific condition. In this example, the condition is an event of type Start, partitioning the search by PARTITION BY user and feature. This way, every user and feature is treated independently when searching for the Start event. LIMIT DURATION limits the search back in time to 1 hour between the End and Start events.
Example:
SELECT -
[user],
feature,
DATEDIFF(
second,
LAST(Time) OVER (PARTITION BY [user], feature LIMIT DURATION(hour,
1) WHEN Event = ‘start’),
Time) as duration -
FROM input TIMESTAMP BY Time -
WHERE -
Event = ‘end’
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-stream-analytics-query-patterns
You are creating an Azure Data Factory data flow that will ingest data from a CSV file, cast columns to specified types of data, and insert the data into a table in an Azure Synapse Analytic dedicated SQL pool. The CSV file contains three columns named username, comment, and date.
The data flow already contains the following:
✑ A source transformation.
✑ A Derived Column transformation to set the appropriate types of data.
✑ A sink transformation to land the data in the pool.
You need to ensure that the data flow meets the following requirements:
✑ All valid rows must be written to the destination table.
✑ Truncation errors in the comment column must be avoided proactively.
✑ Any rows containing comment values that will cause truncation errors upon insert must be written to a file in blob storage.
Which two actions should you perform? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point.
A. To the data flow, add a sink transformation to write the rows to a file in blob storage.
B. To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors.
C. To the data flow, add a filter transformation to filter out rows that will cause truncation errors.
D. Add a select transformation to select only the rows that will cause truncation errors.
Correct Answer: AB 🗳️
A. To the data flow, add a sink transformation to write the rows to a file in blob storage.
This action ensures that the rows causing truncation errors, identified by the Conditional Split, are written to a file in blob storage. This meets the requirement of storing rows that would otherwise cause truncation errors upon insertion.
B. To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors.
The Conditional Split helps identify rows that may cause truncation errors based on specified conditions (in this case, the comment column). This separation allows handling these problematic rows separately.
B: Example:
1. This conditional split transformation defines the maximum length of “title” to be five. Any row that is less than or equal to five will go into the GoodRows stream.
Any row that is larger than five will go into the BadRows stream.
- This conditional split transformation defines the maximum length of “title” to be five. Any row that is less than or equal to five will go into the GoodRows stream.
Any row that is larger than five will go into the BadRows stream.
A: - Now we need to log the rows that failed. Add a sink transformation to the BadRows stream for logging. Here, we’ll “auto-map” all of the fields so that we have logging of the complete transaction record. This is a text-delimited CSV file output to a single file in Blob Storage. We’ll call the log file “badrows.csv”.
- The completed data flow is shown below. We are now able to split off error rows to avoid the SQL truncation errors and put those entries into a log file.
Meanwhile, successful rows can continue to write to our target database.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-data-flow-error-rows
DRAG DROP -
You need to create an Azure Data Factory pipeline to process data for the following three departments at your company: Ecommerce, retail, and wholesale. The solution must ensure that data can also be processed for the entire company.
How should you complete the Data Factory data flow script? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:
Values:
* all, ecommerce, retail, wholesale
* dept == ‘ecommerce’, dept == ‘retail’, dept == ‘wholesale’
* dept – ‘ecommerce’, dept – ‘wholesale’, dept == ‘retail’
* disjoint: false
* disjoint: true
* ecommerce, retail, wholesale, all
CleanData split ( [XXXXXXXXXX] [XXXXXXXXXX] ) ~> SplitByDept@ ( [XXXXXXXXXX])
The conditional split transformation routes data rows to different streams based on matching conditions. The conditional split transformation is similar to a CASE decision structure in a programming language. The transformation evaluates expressions, and based on the results, directs the data row to the specified stream.
Box 1: dept==’ecommerce’, dept==’retail’, dept==’wholesale’
First we put the condition. The order must match the stream labeling we define in Box 3.
Box 2: THIS IS DISPUTED
Majority say disjoint: true
I think disjoint: false as the arguments and sources are more compelling
disjoint is false because the data goes to the first matching condition. All remaining rows matching the third condition go to output stream all.
Box 3: ecommerce, retail, wholesale, all
Label the streams -
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/data-flow-conditional-split
TOUGH QUESTION
DRAG DROP -
You have an Azure Data Lake Storage Gen2 account that contains a JSON file for customers. The file contains two attributes named FirstName and LastName.
You need to copy the data from the JSON file to an Azure Synapse Analytics table by using Azure Databricks. A new column must be created that concatenates the FirstName and LastName values.
You create the following components:
✑ A destination table in Azure Synapse
✑ An Azure Blob storage container
✑ A service principal
Which five actions should you perform in sequence next in is Databricks notebook? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.
Select and Place:
Actions
1. Mount the Data Lake Storage onto DBFS.
2. Write the results to a table in Azure Synapse.
3. Perform transformations on the file.
4. Specify a temporary folder to stage the data.
5. Write the results to Data Lake Storage.
6. Read the file into a data frame.
7. Drop the data frame.
8. Perform transformations on the data frame.
disputed, need to research, go with this if nothing else
To accomplish the task in an Azure Databricks notebook, the logical sequence of actions would be:
Step 1. Mount the Data Lake Storage onto DBFS: This allows access to the JSON file stored in Azure Data Lake Storage using the Databricks File System.
Step 2. Read the file into a data frame: Use Spark to read the JSON file into a DataFrame for processing.
Step 3. Perform transformations on the data frame: Apply transformations to concatenate the FirstName and LastName fields to create a new column.
Step 4. Specify a temporary folder to stage the data: Before writing the data to Azure Synapse, it is a common practice to stage it in a temporary folder.
Step 5. Write the results to a table in Azure Synapse: Finally, write the transformed DataFrame to the destination table in Azure Synapse Analytics.
These steps would ensure the JSON file data is properly transformed and loaded into Azure Synapse Analytics for further use.
link
HOTSPOT -
You build an Azure Data Factory pipeline to move data from an Azure Data Lake Storage Gen2 container to a database in an Azure Synapse Analytics dedicated
SQL pool.
Data in the container is stored in the following folder structure./in/{YYYY}/{MM}/{DD}/{HH}/{mm}
The earliest folder is /in/2021/01/01/00/00. The latest folder is /in/2021/01/15/01/45.
You need to configure a pipeline trigger to meet the following requirements:
✑ Existing data must be loaded.
✑ Data must be loaded every 30 minutes.
✑ Late-arriving data of up to two minutes must be included in the load for the time at which the data should have arrived.
How should you configure the pipeline trigger? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Type:
* Event
* On-demand
* Schedule
* Tumbling window
Additional properties:
* Prefix: /in/, Event: Blob created
* Recurrence: 30 minutes, Start time: 2021-01-01T00:00
* Recurrence: 30 minutes, Start time: 2021-01-01T00:00, Delay: 2 minutes
* Recurrence: 32 minutes, Start time: 2021-01-15T01:45
Box 1: Tumbling window -
To be able to use the Delay parameter we select Tumbling window.
Box 2: Recurrence: 30 minutes, not 32 minutes. Delay: 2 minutes.
The amount of time to delay the start of data processing for the window. The pipeline run is started after the expected execution time plus the amount of delay.
The delay defines how long the trigger waits past the due time before triggering a new run. The delay doesn’t alter the window startTime.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger
HOTSPOT -
You are designing a near real-time dashboard solution that will visualize streaming data from remote sensors that connect to the internet. The streaming data must be aggregated to show the average value of each 10-second interval. The data will be discarded after being displayed in the dashboard.
The solution will use Azure Stream Analytics and must meet the following requirements:
✑ Minimize latency from an Azure Event hub to the dashboard.
✑ Minimize the required storage.
✑ Minimize development effort.
What should you include in the solution? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point
Hot Area:
Azure Stream Analytics input type:
* Azure Event Hub
* Azure SQL Database
* Azure Stream Analytics
* Microsoft Power BI
Azure Stream Analytics output type:
* Azure Event Hub
* Azure SQL Database
* Azure Stream Analytics
* Microsoft Power BI
Aggregation query location:
* Azure Event Hub
* Azure SQL Database
* Azure Stream Analytics
* Microsoft Power BI
Azure Event Hub
Microsoft Power BI
Azure Stream Analytics
hard hard question
DRAG DROP -
You have an Azure Stream Analytics job that is a Stream Analytics project solution in Microsoft Visual Studio. The job accepts data generated by IoT devices in the JSON format.
You need to modify the job to accept data generated by the IoT devices in the Protobuf format.
Which three actions should you perform from Visual Studio on sequence? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.
Select and Place:
Actions
- Change the Event Serialization Format to Protobuf in the input.json file of the job and reference the DLL.
- Add an Azure Stream Analytics Custom Deserializer Project (.NET) project to the solution.
- Add .NET deserializer code for Protobuf to the custom deserializer project.
- Add .NET deserializer code for Protobuf to the Stream Analytics project.
- Add an Azure Stream Analytics Application project to the solution.
2. Add an Azure Stream Analytics Custom Deserializer Project (.NET) project to the solution.
3. Add .NET deserializer code for Protobuf to the custom deserializer project.
Popular beleive in chat is that this next
1. Change the Event Serialization Format to Protobuf in the input.json file of the job and reference the DLL.
You have an Azure Storage account and a data warehouse in Azure Synapse Analytics in the UK South region.
You need to copy blob data from the storage account to the data warehouse by using Azure Data Factory. The solution must meet the following requirements:
✑ Ensure that the data remains in the UK South region at all times.
✑ Minimize administrative effort.
Which type of integration runtime should you use?
A. Azure integration runtime
B. Azure-SSIS integration runtime
C. Self-hosted integration runtime
Correct Answer: A 🗳️
Incorrect Answers:
C: Self-hosted integration runtime is to be used On-premises.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime
HOTSPOT -
You have an Azure SQL database named Database1 and two Azure event hubs named HubA and HubB. The data consumed from each source is shown in the following table.
**You need to implement Azure Stream Analytics to calculate the average fare per mile by driver.
How should you configure the Stream Analytics input for each source?**To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
HubA:
* Stream
* Reference
HubB:
* Stream
* Reference
Database1:
* Stream
* Reference
HubA: Stream -
HubB: Stream -
Database1: Reference -
Reference data (also known as a lookup table) is a finite data set that is static or slowly changing in nature, used to perform a lookup or to augment your data streams. For example, in an IoT scenario, you could store metadata about sensors (which don’t change often) in reference data and join it with real time IoT data streams. Azure Stream Analytics loads reference data in memory to achieve low latency stream processing
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data
You have an Azure Stream Analytics job that receives clickstream data from an Azure event hub.
You need to define a query in the Stream Analytics job. The query must meet the following requirements:
✑ Count the number of clicks within each 10-second window based on the country of a visitor.
✑ Ensure that each click is NOT counted more than once.
How should you define the Query?
~~~
A. SELECT Country, Avg() AS Average FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SlidingWindow(second, 10)
B. SELECT Country, Count() AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, TumblingWindow(second, 10)
C. SELECT Country, Avg() AS Average FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, HoppingWindow(second, 10, 2)
D. SELECT Country, Count() AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SessionWindow(second, 5, 10)
~~~
Correct Answer: B 🗳️
Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window.
Example:
Incorrect Answers:
A: Sliding windows, unlike Tumbling or Hopping windows, output events only for points in time when the content of the window actually changes. In other words, when an event enters or exits the window. Every window has at least one event, like in the case of Hopping windows, events can belong to more than one sliding window.
C: Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap, so events can belong to more than one Hopping window result set. To make a Hopping window the same as a Tumbling window, specify the hop size to be the same as the window size.
D: Session windows group events that arrive at similar times, filtering out periods of time where there is no data.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
Hard
HOTSPOT -
You are building an Azure Analytics query that will receive input data from Azure IoT Hub and write the results to Azure Blob storage.
You need to calculate the difference in the number of readings per sensor per hour.
How should you complete the query? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
SELECT sensorId, growth = reading - - [ LAG / LAST / LEAD ] (reading) OVER (PARTITION BY sensorId [ LIMIT DURATION / OFFSET / WHEN ] (hour, 1) ) FROM input
Box 1: LAG -
The LAG analytic operator allows one to look up a ג€previousג€ event in an event stream, within certain constraints. It is very useful for computing the rate of growth of a variable, detecting when a variable crosses a threshold, or when a condition starts or stops being true.
Box 2: LIMIT DURATION -
Example: Compute the rate of growth, per sensor:
SELECT sensorId,
growth = reading -
LAG(reading) OVER (PARTITION BY sensorId LIMIT DURATION(hour, 1))
FROM input -
Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/lag-azure-stream-analytics
You need to schedule an Azure Data Factory pipeline to execute when a new file arrives in an Azure Data Lake Storage Gen2 container.
Which type of trigger should you use?
A. on-demand
B. tumbling window
C. schedule
D. event
Correct Answer: D 🗳️
Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the arrival or deletion of a file in Azure
Blob Storage account.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger
You have two Azure Data Factory instances named ADFdev and ADFprod. ADFdev connects to an Azure DevOps Git repository.
You publish changes from the main branch of the Git repository to ADFdev.
You need to deploy the artifacts from ADFdev to ADFprod.
What should you do first?
A. From ADFdev, modify the Git configuration.
B. From ADFdev, create a linked service.
C. From Azure DevOps, create a release pipeline.
D. From Azure DevOps, update the main branch.
Correct Answer: C 🗳️
In Azure Data Factory, continuous integration and delivery (CI/CD) means moving Data Factory pipelines from one environment (development, test, production) to another.
Note: The following is a guide for setting up an Azure Pipelines release that automates the deployment of a data factory to multiple environments.
1. In Azure DevOps, open the project that’s configured with your data factory.
2. On the left side of the page, select Pipelines, and then select Releases.
3. Select New pipeline, or, if you have existing pipelines, select New and then New release pipeline.
4. In the Stage name box, enter the name of your environment.
5. Select Add artifact, and then select the git repository configured with your development data factory. Select the publish branch of the repository for the Default branch. By default, this publish branch is adf_publish.
6. Select the Empty job template.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment