Exam questions Flashcards

Question

You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool. Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for use in inventory reports. The inventory reports will use the data and additional WHERE parameters depending on the report. The reports will be produced once daily. **You need to implement a solution to make the dataset available for the reports. The solution must minimize query times. What should you implement?** A. an ordered clustered columnstore index B. a materialized view C. result set caching D. a replicated table

Answer 1

only time maternialised view appears in question set **B is correct.** **Materialized view** and result set caching These two features in dedicated SQL pool are used for query performance tuning. **Result set caching is used** for getting high concurrency and fast response from repetitive queries **against static data.** To use the cached result, the form of the cache requesting query must match with the query that produced the cache. In addition, the cached result must apply to the entire query. **Materialized views allow data changes in the base tables**. Data in materialized views can be applied to a piece of a query. This support allows the same materialized views to be used by different queries that share some computation for faster performance.

Answer 2

**Correct Answer: D** Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each database existing in serverless Apache Spark pools. For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool database. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables **So A and D. Parquet are faster so D**

Answer 3

**Correct Answer: D** Azure Databrics as the question is clearly asking the support for Java programming. Reference: [SEE SITE/REFERENCE FOR CONTEXT] https://www.examtopics.com/exams/microsoft/dp-203/view/3/ https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/stream-processing

Answer 4

Debated quite heavily, see site,GO WITH FOLLOWING **Selected Answer: D** D. Merge the files To optimize the files stored in the Azure Data Lake Storage Gen2 container for batch processing, you should merge the files. Merging smaller files into larger files is a common optimization technique in data processing scenarios. Having a large number of small files can introduce overhead in terms of file management, metadata processing, and data scanning. By merging the smaller files into larger files, you can reduce this overhead and improve the efficiency of batch processing operations. Merging the files is especially beneficial when dealing with varying file sizes, as it helps to create a more balanced distribution of data across the files and reduces the impact of small files on processing performance. Therefore, in this scenario, merging the files would be the recommended approach to optimize the files for batch processing. **Correct Answer: B** Avro supports batch and is very relevant for streaming. Note: Avro is framework developed within Apache's Hadoop project. It is a row-based storage format which is widely used as a serialization process. AVRO stores its schema in JSON format making it easy to read and interpret by any program. The data itself is stored in binary format by doing it compact and efficient. Reference: https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/ You can not merge the files if u don't know how many files exist in ADLS2. In this case, you could easily create a file larger than 100 GB in size and decrease performance. so B is the correct answer. Convert to AVRO

Answer 5

**Box 1: moved to cool storage -** The ManagementPolicyBaseBlob.TierToCool property gets or sets the function to tier blobs to cool storage. Support blobs currently at Hot tier. **Box 2: container1/contoso.csv -** As defined by prefixMatch. prefixMatch: An array of strings for prefixes to be matched. Each rule can define up to 10 case-senstive prefixes. A prefix string must start with a container name. Reference: https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.management.storage.fluent.models.managementpolicybaseblob.tiertocool

Answer 6

**Correct Answer: D** For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases. Example: Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition. Select D because analysts will most commonly analyze transactions for a given month,

Answer 7

**Box 1: Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier** **"Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive tier and then deleted or moved to the Hot tier after 45 days, you'll be charged an early deletion fee equivalent to 135 (180 minus 45) days of storing that blob in the Archive tier."** For infrastructure logs: Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the cool tier should be stored for a minimum of 30 days. The cool tier has lower storage costs and higher access costs compared to the hot tier. For application logs: Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on the order of hours. Data in the archive tier should be stored for a minimum of 180 days. **Box 2: Azure Blob storage lifecycle management rules** Blob storage lifecycle management offers a rule-based policy that you can use to transition your data to the desired access tier when your specified conditions are met. You can also use lifecycle management to expire data at the end of its life. Reference: https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview **DISCUSSION** "Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive tier and then deleted or moved to the Hot tier after 45 days, you'll be charged an early deletion fee equivalent to 135 (180 minus 45) days of storing that blob in the Archive tier." <- from the sourced link. This explains why we have to use two different access tiers rather than both as archive.

Answer 8

**Correct Answer: B** Need Parquet to support both Databricks and PolyBase. Reference: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql Avro schema definitions are JSON records. Polybase does not support JSON so why supporting Avro then. A CSV does not contain the schema as it is everything marked as string. so only parquet is left to choose.

Answer 9

DEBATED * **this must be C**. since the need is to overwrite dbo.Sales with the content of stg.Sales. SWITCH source TO target * This is quite a weird situation because according to Microsoft documentation: "When reassigning a table's data as a partition to an already-existing partitioned table, or switching a partition from one partitioned table to another, the target partition must exist and it MUST BE EMPTY." (https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-table-transact-sql?view=azure-sqldw-latest&preserve-view=true#switch--partition-source_partition_number_expression--to--schema_name--target_table--partition-target_partition_number_expression-) Therefore none of the options would be possible if considering that both tables are not empty on that partition. Then I have no idea what would be the correct answer, although I answered C. **Exam Topics pick** Correct Answer: B 🗳️ A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data monthly. Then you can switch out the partition with data for an empty partition from another table Note: Syntax: SWITCH [ PARTITION source_partition_number_expression ] TO [ schema_name. ] target_table [ PARTITION target_partition_number_expression ] Switches a block of data in one of the following ways: ✑ Reassigns all data of a table as a partition to an already-existing partitioned table. ✑ Switches a partition from one partitioned table to another. ✑ Reassigns all data in one partition of a partitioned table to an existing non-partitioned table. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

Answer 10

**DEBATED** but pretty confident in following **The answer is ABE**. A type 2 SCD requires a surrogate key to uniquely identify each record when versioning. See https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types under SCD Type 2 “ the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member.” A business key is already part of this table - SupplierSystemID. The column is derived from the source data. **Exam Topics Anwser** Correct Answer: BCE C: The Slowly Changing Dimension transformation requires at least one business key column. BE: Historical attribute changes create new records instead of updating existing ones. The only change that is permitted in an existing record is an update to a column that indicates whether the record is current or expired. This kind of change is equivalent to a Type 2 change. The Slowly Changing Dimension transformation directs these rows to two outputs: Historical Attribute Inserts Output and New Output. Reference: https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation

Answer 11

**Box 1: Denormalize to a second normal form** Denormalization is the process of transforming higher normal forms to lower normal forms via storing the join of higher normal form relations as a base relation. Denormalization increases the performance in data retrieval at cost of bringing update anomalies to a database. **Box 2: New identity columns -** The collapsing relations strategy can be used in this step to collapse classification entities into component entities to obtain flat dimension tables with single-part keys that connect directly to the fact table. The single-part key is a surrogate key generated to ensure it remains unique over time. Example: Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance. Reference: https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/ https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity

Answer 12

**Box 1: partitionBy -** We should overwrite at the partition level. Example: df.write.partitionBy("y","m","d") .mode(SaveMode.Append) .parquet("/data/hive/warehouse/db_name.db/" + tableName) **Box 2: ("StoreID", "Year", "Month", "Day", "Hour", "StoreID")** **if partitioned by storeid and hour only, the same hours from different days would go to the same partition, that would be innefficient** **Box 3: parquet("/Purchases")** Reference: https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data

Answer 13

**Correct Answer: A** Each partition should have around 1 millions records. Dedication SQL pools already have 60 partitions. We have the formula: Records/(Partitions*60)= 1 million Partitions= Records/(1 million * 60) Partitions= 2.4 x 1,000,000,000/(1,000,000 * 60) = 40 Note: Having too many partitions can reduce the effectiveness of clustered columnstore indexes if each partition has fewer than 1 million rows. Dedicated SQL pools automatically partition your data into 60 databases. So, if you create a table with 100 partitions, the result will be 6000 partitions. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

Answer 14

**Type2** because there are start and end columns and ProductKey is a surrogate key. ProductNumber seems a business key. product key is a **surrogate key** as it is an identity column

Answer 15

**Correct Answer: B** Hash-distributed tables improve query performance on large fact tables, and are the focus of this article. Round-robin tables are useful for improving loading speed. Incorrect: Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute **Lots of discusssion on this one**

Answer 16

**Answer should be A**, because this talks about minimizing storage costs, not querying costs id go with A **Lots of debating this**

Answer 17

Step 1: Create an external data source You can create external tables in Synapse SQL pools via the following steps: 1. **CREATE EXTERNAL DATA SOURCE** to reference an external Azure storage and specify the credential that should be used to access the storage. 2. **CREATE EXTERNAL FILE FORMAT** to describe format of CSV or Parquet files. Step 2: Create an external file format object Creating an external file format is a prerequisite for creating an external table. 3. **CREATE EXTERNAL TABLE** on top of the files placed on the data source with the same file format. Step 3: Create an external table Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables

Answer 18

**C. a dimension table for Employee** **E. a fact table for Transaction** C: Dimension tables contain attribute data that might change but usually changes infrequently. For example, a customer's name and address are stored in a dimension table and updated only when the customer's profile changes. To minimize the size of a large fact table, the customer's name and address don't need to be in every row of a fact table. Instead, the fact table and the dimension table can share a customer ID. A query can join the two tables to associate a customer's profile and transactions. E: Fact tables contain quantitative data that are commonly generated in a transactional system, and then loaded into the dedicated SQL pool. For example, a retail business generates sales transactions every day, and then loads the data into a dedicated SQL pool fact table for analysis. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview

Answer 19

**Correct Answer: C** A **Type 2 SCD supports versioning of dimension members.** Often the source system doesn't store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example,M IsCurrent) to easily filter by current dimension members. Incorrect Answers: B: A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten. D: A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD. Reference: https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types

Answer 20

go with chat **STEP 1: Create a database scoped credential that uses Azure Active Directory Application and a Service Principal Key** **Step 2: Create an external data source that uses the abfs location** Create External Data Source to reference Azure Data Lake Store Gen 1 or 2 **Step 3: Create an external file format and set the First_Row option.** DEBATED Examtopics anwser **Step 1: Create an external data source that uses the abfs location** Create External Data Source to reference Azure Data Lake Store Gen 1 or 2 **Step 2: Create an external file format and set the First_Row option.** Create External File Format. **Step 3: Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages** To use PolyBase, you must create external tables to reference your external data. Use reject options. Note: REJECT options don't apply at the time this CREATE EXTERNAL TABLE AS SELECT statement is run. Instead, they're specified here so that the database can use them at a later time when it imports data from the external table. Later, when the CREATE TABLE AS SELECT statement selects data from the external table, the database will use the reject options to determine the number or percentage of rows that can fail to import before it stops the import. Reference: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-t-sql-objects https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-as-select-transact-sql

Answer 21

**Box 1: PARTITION -** RANGE RIGHT FOR VALUES is used with PARTITION. **Part 2: [TransactionDateID]** Partition on the date column. Example: Creating a RANGE RIGHT partition function on a datetime column The following partition function partitions a table or index into 12 partitions, one for each month of a year's worth of values in a datetime column. CREATE PARTITION FUNCTION [myDateRangePF1] (datetime) AS RANGE RIGHT FOR VALUES ('20030201', '20030301', '20030401', '20030501', '20030601', '20030701', '20030801', '20030901', '20031001', '20031101', '20031201'); Reference: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql

Answer 22

**Correct Answer: D**

Answer 23

**Box 1: select -** **Box 2: explode -** **Bop 3: alias -** pyspark.sql.Column.alias returns this column aliased with a new name or names (in the case of expressions that return more than one column, such as explode). Reference: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.alias.html https://docs.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/explode

Answer 24

**First Week**: Hot **After One Month**: Cool **After OneYear**: Cool

Answer 25

**Should be D**, it's about Apache Spark pool, not serverless SQL pool.

Answer 26

**Answer D is correct**. Azure Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use instances. You can use Databricks Pools to Speed up your Data Pipelines and Scale Clusters Quickly. Databricks Pools, a managed cache of virtual machine instances that enables clusters to start and scale 4 times faster. Reference: https://databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html

Answer 27

First Box = **{date}/product.csv** - Because the requirement is reference data loaded on daily basis, so it may be once in a day not hourly or timely. second box is straight forwarded answer **YYYY-MM-DD**

Answer 28

DEBATED HEAVILY **False (60/40), True (reasonably confident), False (reasonably confident).** https://learn.microsoft.com/en-us/azure/stream-analytics/repartition The first is False, because this: "The following example query joins two streams of repartitioned data." It's extracted from the link above, and it's pointing to our query! Repartitioned and not partitioned. Second is True, it's explicitly written The output scheme should match the stream scheme key and count so that each substream can be flushed independently. Third is False, "In general, six SUs are needed for each partition." In the example we have 10 positions for step 1 and 10 for step 2, it should be 120 and not 60.

Answer 29

**Box 1: CREATE EXTERNAL TABLE -** An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or serverless SQL pool. Syntax: CREATE EXTERNAL TABLE { database_name.schema_name.table_name | schema_name.table_name | table_name } ( [ ,...n ] ) WITH ( LOCATION = 'folder_or_filepath', DATA_SOURCE = external_data_source_name, FILE_FORMAT = external_file_format_name **Box 2. OPENROWSET -** When using serverless SQL pool, CETAS is used to create an external table and export query results to Azure Storage Blob or Azure Data Lake Storage Gen2. Example: AS - SELECT decennialTime, stateName, SUM(population) AS population FROM - OPENROWSET(BULK 'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=*/*.parquet', FORMAT='PARQUET') AS [r] GROUP BY decennialTime, stateName GO - Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables

Answer 30

**BOX 1: DEBATED, go for dfs i think** **Box 2: TYPE = HADOOP**

Answer 31

**Correct Answer: B 🗳️** Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of CSV files statistics is supported. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-statistics

Answer 32

**Box 1: HASH -** **Box 2: OrderDateKey -** In most cases, table partitions are created on a date column. A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data early. Then you can switch out the partition with data for an empty partition from another table. Reference:

Answer 33

**Box 1: Move to cool storage -** **Box 2: Move to archive storage -** Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of hours. The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.

Answer 34

DEBATED HEAVILY **Zone-redundant storage (ZRS)** 'Ensures that content remains available for writes if a primary data center fails'. RA-GRS and RAGZRS provide read access only after failover. The correct answer is ZRS as t=stated in the link below "Microsoft recommends using ZRS in the primary region for Azure Data Lake Storage Gen2 workloads." https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json **Failover initiated by Microsoft** Failover initiated by Microsoft. Customer-managed account failover is not yet supported in accounts that have a hierarchical namespace (Azure Data Lake Storage Gen2). To learn more, see Blob storage features available in Azure Data Lake Storage Gen2.

Answer 35

**Correct Answer: BE** A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD. This type of tracking may be used for one or two columns in a dimension table. It is not common to use it for many members of the same table. It is often used in combination with Type 1 or Type 2 members. SEE SITE FOR IMG EXPLANTION https://www.examtopics.com/exams/microsoft/dp-203/view/6/ Reference: https://k21academy.com/microsoft-azure/azure-data-engineer-dp203-q-a-day-2-live-session-review/

Answer 36

**Box 1: Replicated -** The best table storage option for a small table is to replicate it across all the Compute nodes. **Box 2: Hash -** Hash-distribution improves query performance on large fact tables. **Box 3: Round-robin -** Round-robin distribution is useful for improving loading speed. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute Replicated (Because its a Dimension table) Hash (Fact table with High volume of data) Round-Robin (Staging table)

Answer 37

**Box 1: (CLUSTERED COLUMNSTORE INDEX** CLUSTERED COLUMNSTORE INDEX - Columnstore indexes are the standard for storing and querying large data warehousing fact tables. This index uses column-based data storage and query processing to achieve gains up to 10 times the query performance in your data warehouse over traditional row-oriented storage. You can also achieve gains up to 10 times the data compression over the uncompressed data size. Beginning with SQL Server 2016 (13.x) SP1, columnstore indexes enable operational analytics: the ability to run performant real-time analytics on a transactional workload. Note: Clustered columnstore index A clustered columnstore index is the physical storage for the entire table. To reduce fragmentation of the column segments and improve performance, the columnstore index might store some data temporarily into a clustered index called a deltastore and a B-tree list of IDs for deleted rows. The deltastore operations are handled behind the scenes. To return the correct query results, the clustered columnstore index combines query results from both the columnstore and the deltastore. **Box 2: HASH([ProductKey])** A hash distributed table distributes rows based on the value in the distribution column. A hash distributed table is designed to achieve high performance for queries on large tables. Choose a distribution column with data that distributes evenly Incorrect: * Not HASH([OrderDateKey]). Is not a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work * A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated tables since joins on replicated tables don't require data movement. Replication requires extra storage, though, and isn't practical for large tables. * A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly. Loading data into a round-robin table is fast. Keep in mind that queries can require more data movement than the other distribution methods. Reference: https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Answer 38

Debated **Correct Answer: B 🗳️** Remembering that we have data splitted in distribution (60 nodes) and considering that **we Need a MINMIUM 1 million rows per distribution**, we have: A. once per month = 30 milion / 60 = 500k record per partition B. once per year = 360 milion / 60 = 6 milion record per partition C. once per day = about 1 milion / 60 = 16k record per partition D. once per week =about 7.5 milion / 60 = 125k record per partition correct should be B Need a minimum 1 million rows per distribution. Each table is 60 distributions. 30 millions rows is added each month. Need 2 months to get a minimum of 1 million rows per distribution in a new partition. Note: When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributions. Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition

Answer 39

**Correct Answer: D 🗳️** When applying updates to a Type 2 slowly changing dimension (SCD) table in Azure Databricks, the best option is to use the MERGE operation in Apache Spark SQL. This operation allows you to combine the data from the source table with the data in the destination table, and then update or insert the appropriate r ecords. The MERGE operation provides a powerful and flexible way to handle updates for SCD tables, as it can handle both updates and inserts in a single operation. Additionally, this operation can be performed on Delta Lake tables, which can easily handle the ACID transactions needed for handling SCD updates. The Delta provides the ability to infer the schema for data input which further reduces the effort required in managing the schema changes. The Slowly Changing Data(SCD) Type 2 records all the changes made to each key in the dimensional table. These operations require updating the existing rows to mark the previous values of the keys as old and then inserting new rows as the latest values. Also, Given a source table with the updates and the target table with dimensional data, SCD Type 2 can be expressed with the merge. Example: // Implementing SCD Type 2 operation using merge function customersTable .as("customers") .merge( stagedUpdates.as("staged_updates"), "customers.customerId = mergeKey") .whenMatched("customers.current = true AND customers.address <> staged_updates.address") .updateExpr(Map( "current" -> "false", "endDate" -> "staged_updates.effectiveDate")) .whenNotMatched() .insertExpr(Map( "customerid" -> "staged_updates.customerId", "address" -> "staged_updates.address", "current" -> "true", "effectiveDate" -> "staged_updates.effectiveDate", "endDate" -> "null")) .execute() } Reference: https://www.projectpro.io/recipes/what-is-slowly-changing-data-scd-type-2-operation-delta-table-databricks When applying updates to a Type 2 slowly changing dimension (SCD) table in Azure Databricks, the best option is to use the MERGE operation in Apache Spark SQL. This operation allows you to combine the data from the source table with the data in the destination table, and then update or insert the appropriate records. The MERGE operation provides a powerful and flexible way to handle updates for SCD tables, as it can handle both updates and inserts in a single operation. Additionally, this operation can be performed on Delta Lake tables, which can easily handle the ACID transactions needed for handling SCD updates.

Answer 40

**Correct Answer: D 🗳️** Parquet, an open-source file format for Hadoop, stores nested data structures in a flat columnar format. Compared to a traditional approach where data is stored in a row-oriented approach, Parquet file format is more efficient in terms of storage and performance. It is especially good for queries that read particular columns from a ג€wideג€ (with many columns) table since only needed columns are read, and IO is minimized. Incorrect: Not C: The Avro format is the ideal candidate for storing data in a data lake landing zone because: 1. Data from the landing zone is usually read as a whole for further processing by downstream systems (the row-based format is more efficient in this case). 2. Downstream systems can easily retrieve table schemas from Avro files (there is no need to store the schemas separately in an external meta store). 3. Any source schema change is easily handled (schema evolution). Reference: https://www.clairvoyant.ai/blog/big-data-file-formats

Answer 41

**Correct Answer: A 🗳️** **Variations of this question** Solution: You convert the files to compressed delimited text files. Does this meet the goal? **YES** Solution: You copy the files to a table that has a columnstore index. Does this meet the goal? **NO** Solution: You modify the files to ensure that each row is more than 1 MB. Does this meet the goal? **NO** Solution: You modify the files to ensure that each row is less than 1 MB. Does this meet the goal? **YES** Polybase loads rows that are smaller than 1 MB. Note on Polybase Load: PolyBase is a technology that accesses external data stored in Azure Blob storage or Azure Data Lake Store via the T-SQL language. Extract, Load, and Transform (ELT) Extract, Load, and Transform (ELT) is a process by which data is extracted from a source system, loaded into a data warehouse, and then transformed. The basic steps for implementing a PolyBase ELT for dedicated SQL pool are: Extract the source data into text files. Land the data into Azure Blob storage or Azure Data Lake Store. Prepare the data for loading. Load the data into dedicated SQL pool staging tables using PolyBase. Transform the data. Insert the data into production tables. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-service-capacity-limits https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview

Answer 42

**Correct Answer: A 🗳️** A replicated table has a full copy of the table accessible on each Compute node. Replicating a table removes the need to transfer data among Compute nodes before a join or aggregation. Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed. 2 GB is not a hard limit. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-guidance-for-replicated-tables

Answer 43

**Correct Answer: C 🗳️** Use IDENTITY to create surrogate keys using dedicated SQL pool in AzureSynapse Analytics. Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity

Answer 44

**yes, no, no** >> Both Hadoop and native external tables will skip the files with the names that begin with an underline (_) or a period (.).

Answer 45

**Selected Answer: B** When designing a fact table in a data warehouse, it is important to consider the types of queries that will be run against it. In this case, the queries that need to be optimized include: show order counts by week, calculate sales totals by region, calculate sales totals by product, and find all the orders from a given month. Partitioning the table by month would be the best option in this scenario as it would allow for efficient querying of data by month, which is necessary for the query operations described above. For example, it would be easy to find all the orders from a given month by only searching the partition for that specific month.

Answer 46

chat GPT: Based on the given usage patterns and requirements, the recommended folder structure **would be option B:** \DataSource\SubjectArea\YYYY-WW\FileData_YYYY_MM_DD.parquet This structure allows for easy filtering of data by year and week, which aligns with the identified usage pattern of most queries filtering by the current year or week. It also organizes the data by data source and subject area, which simplifies folder security. By using a flat structure, with the data files directly under the year-week folder, query times can be minimized as the data is organized for efficient partition pruning. Option A is similar but includes an additional level of hierarchy for the year, which is unnecessary given the requirement to filter by year-week. Options C, D, and E do not follow a consistent hierarchy, making it difficult to navigate and locate specific data files.

Answer 47

**D. ALTER INDEX ALL on table1 REBUILD** This statement will rebuild all indexes on table1, which can help to maximize columnstore compression. The other options are not appropriate for this task. DBCC INDEXDEFRAG (pool1, table1) is for defragmenting the indexes and DBCC DBREINDEX (table1) is for recreating the indexes. ALTER INDEX ALL on table1 REORGANIZE is for reorganizing the indexes.

Answer 48

DEBATED **Selected Answer: BE** The date of insertion and the expiration date from when to when is something else. You can insert data now, but either with future validity or with past validity (correcting errors, for example). So options : BE

Answer 49

**1. Hash(CustomerID)** **2. Replicate** It is hash because it is a fact table (you can tell because there is the "total" column being created which is numerical). Rule of thumb, never hash on a date field, so in this case you would hash on 'CustomerID'. You want the hash to have as many unique values as possible.

Answer 50

**Correct Answer: A 🗳️** Correct! Serverless SQL Pools cannot use Hadoop, Only Native. Access Key Auth is never best practice therefore leaving only A as a viable answer. https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop The other options provided (B, C, and D) are not the recommended configurations for maximizing performance in this scenario. Using a storage account key for authentication (option B) poses a security risk and should be avoided. Apache Hadoop external tables (options C and D) do not provide the same level of performance optimization as native external tables in Azure Synapse Analytics.

Answer 51

Correct answer should be **join DimGeography and DimCustomer** and 5 tables. You **also need to combine ProductLine and Product** in order for the schema to be considered a star schema. **This would result in 5 remaining tables**: DimCustomer (DimCustomer JOIN DimGeography), DimStore, Date, Product (Product JOIN ProductLine) and FactOrders.

Answer 52

**Correct Answer: C 🗳️** Bing explains the following: The best option is C. Auto Loader. Auto Loader is a feature in Azure Databricks that uses a cloudFiles data source to incrementally and efficiently process new data files as they arrive in Azure Data Lake Storage Gen2. It supports schema inference and schema evolution (drift). It also minimizes implementation and maintenance effort, as it simplifies the ETL pipeline by reducing the complexity of identifying new files for processing. Other options do not meet the requirements because: A. COPY INTO: does not incrementally process new files as they are uploaded, which is one of your requirements. B. Azure Data Factory: does not natively support schema inference and schema drift. The incremental processing of new files would need to be manually implemented, which could increase implementation and maintenance effort. D. Apache Spark FileStreamSource: requires manual setup and does not natively support schema inference or schema drift. It also may not minimize the cost of processing millions of files as efficiently as Auto Loader.

Answer 53

To read TSV files without a header row using the `OPENROWSET` function and to assign a name and specify the data type for each column, you should use: **A. the WITH clause** The WITH clause is used in the `OPENROWSET` function to define the format file or to directly define the structure of the file by specifying the column names and data types.

Answer 54

Clustered Column Store will by default have 60 partitions. And to achieve best compression we need at least 1 Million rows per partition, hence **Option D 60 Millions** (**1M per partition**)

Answer 55

DISPUTED **Correct Answer: CD 🗳️** **Selected Answer: BC 1) Insert an extra row with the updated last name and the current date as the StartDate. 2) Update two columns of an existing row: set the EndDate of the previous row for that salesperson to the current date and set the current value of the SalesRepID column to inactive**

Answer 56

Statement 1: For Storage redundancy, you should select **ZRS** (Zone-redundant storage). This will maintain read and write access to data even if an availability zone becomes unavailable. Statement 2: For data deletion, you should select **A lifecycle management policy**. This will allow you to automatically delete data that was last modified more than two years ago

Answer 57

The answer is correct. Check "Exercise - Design and implement a Type 1 slowly changing dimension with mapping data flows", there is described implementation of the dataflow mentioned in this question. https://learn.microsoft.com/en-us/training/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/4-exercise-design-implement-type-1-dimension In the exercise '**Derived column**' transformation is used to add InsertedDate and ModifiedDate columns. ModifiedDate column can be used to detect whether the customer data has changed. For Upsert '**Alter row**' tranformation is used. The answer is definitely correct.

Answer 58

**OPENROWSET** **Bulk**

Answer 59

**Selected Answer: A** Parquet is supported by serverless SQL pool https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-parquet-files

Answer 60

**Selected Answer: AD** Query acceleration supports CSV and JSON formatted data as input to each request.

Answer 61

JSON we sue CSV for Chat says **D: CSV**

Answer 62

**Archive storage after one day and delete storage after 2,555 days.** **Cool storage after 180 days.** The answer for HR depends on the meaning of "rarely" and the duration of "initial processing". If rarely is like once a year and initial processing is complete within 24 h the answer is correct. If rarely is like on a weekly basis, archiv might be the wrong way

Answer 63

Chat too confusing, using GPT answer **No, No, Yes** 1. **The shared catalog objects can be stored in Azure Database for MySQL.** Azure Synapse Analytics doesn't support storing shared catalog objects in Azure Database for MySQL. Instead, it uses an Apache Hive Metastore as a common catalog for multiple workspaces. 2. **For the Apache Hive Metastore of each workspace, you must configure a linked service that uses user-password authentication.** Azure Synapse Analytics doesn't require user-password authentication for the Apache Hive Metastore. It typically relies on service principals or managed identities for authentication. 3. **The users of workspace1 must be assigned the Storage Blob Contributor role for datalake1.** For workspace1 to read and write data to datalake1, users in workspace1 need adequate permissions. The Storage Blob Contributor role provides the necessary permissions to read and write data to Azure Data Lake Storage Gen2, which is datalake1 in this case. Therefore, assigning the Storage Blob Contributor role to users in workspace1 would be appropriate. CHAT SAYS Provided answers are correct: **1. Yes:** Azure Synapse Analytics allows Apache Spark pools in the same workspace to share a managed HMS (Hive Metastore) compatible metastore as their catalog. When customers want to persist the Hive catalog metadata outside of the workspace, and share catalog objects with other computational engines outside of the workspace, such as HDInsight and Azure Databricks, they can connect to an external Hive Metastore. Only Azure SQL Database and Azure Database for MySQL are supported as an external Hive Metastore. **2. Yes:** And currently we only support User-Password authentication. **3. No:** And currently we only support User-Password authentication. ==> STORAGE BLOB CONTRIBUTOR is an Azure RBAC (Role-Based Access Control) ==> NOT COMPATIBLE (it is supported User-Password authentication ONLY). ref. https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-external-metastore

Answer 64

[DISPUTED](https://www.examtopics.com/discussions/microsoft/view/111554-exam-dp-203-topic-1-question-88-discussion/#:~:text=You%20need%20to%20implement%20a,must%20meet%20the%20following%20requirements%3A&text=Prevent%20changes%20to%20the%20values%20stored%20in%20ProductName.&text=Retain%20only%20the%20current%20and%20the%20last%20values%20in%20ProductSize.):go with this **ProductName - type 0**, as no changes are done. **Color - type 3**, as with type 3 we have one column for the current value and one for the previous so only these two are preserved. **Size - type 2**, as it inserts a new row for every change, so we get all historical values.

Answer 65

The **ROUND_ROBIN distribution** distributes the data evenly across all distribution nodes in the SQL pool. This distribution type is suitable for loading data quickly into the staging tables because it minimizes the data movement during the loading process. Use a **HEAP table**: Instead of creating a clustered index on the staging table, it is recommended to create a HEAP table. A HEAP table does not have a clustered index, which eliminates the need for maintaining the index and improves the data loading performance. It allows for faster insert operations.

Answer 66

**Correct Answer: ACD 🗳️** A. Enable Azure Synapse Link for Cosmos1. C. In ws1, create a linked service that references Cosmos1. D. Enable the analytical store for container1.

Answer 67

**No**, **Yes**, **No** It will ignore "_" and "."

Answer 68

Box 1: **Execute** Box 2: **Read & Execute** https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control#levels-of-permission

Answer 69

GPT Says The scenario requires a trigger that ensures the execution of the pipeline at a specific time daily and has the capability to resume or continue executions even after a trigger stoppage or restart. For this purpose: **A. Schedule** A schedule trigger in Azure Data Factory allows you to specify a recurring execution time, such as running a pipeline at 2 AM every day. Additionally, if the trigger stops due to any reason, it will resume its schedule, and the next execution will occur at the defined time (in this case, 2 AM) following a trigger restart. **Some commenets claim ir might be tumbling**

Answer 70

GO WITH FOLLOWING, don't know why YES NO YES https://learn.microsoft.com/en-us/azure/data-factory/data-flow-aggregate

Answer 71

**B. Cache hit percentage** should be correct since it only affects common used queries, which should be saved and loaded from cache.

Answer 72

BULK [(Col1 int, Col2 varchar(20)),

Answer 73

**Selected Answer: D** D is correct. "The OPENROWSET function is not supported in dedicated SQL pools in Azure Synapse." so it eliminates A,B and C. Ref: https://learn.microsoft.com/en-us/sql/t-sql/functions/openrowset-transact-sql?view=sql-server-ver16 Only the partitioned view in the serverless sql pool is correct since "External tables in serverless SQL pools do not support partitioning on Delta Lake format. Use Delta partitioned views instead of tables if you have partitioned Delta Lake data sets." Ref: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-external-tables#delta-tables-on-partitioned-folders

Answer 74

**Correct Answer: A 🗳️** Partitioned by month and with 60 nodes, means it’s 1M per combination

Answer 75

**Correct, AE.** Only table definition and their relationship is included in the template. The rest of the options should be configured Ref: https://learn.microsoft.com/en-us/azure/synapse-analytics/database-designer/create-lake-database-from-lake-database-templates

Answer 76

**Selected Answer: CEF** Correct. The other three options are needed for a scd type 2 table.

Answer 77

The input type for the **Stream** Analytics job should be Stream, as it will be processing real-time data from devices. The function to include in the Stream Analytics job should be **Geospatial**, which allows you to perform calculations on geographic data and make spatial queries, such as determining the distance between two points. This is necessary to determine if a device has traveled more than 200 meters away from a designated location.

Answer 78

Disputed 50/50 **Correct Answer: DF 🗳️** D: Scale out the query by allowing the system to process each input partition separately. F: A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization HOWEVER **Dicussion has the most likes (61) for CF** go for this C: Implement query parallelization by partitioning the data output. F: Implement query parallelization by partitioning the data input.

Answer 79

**Correct Answer: C 🗳️** Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the arrival or deletion of a file in Azure Blob Storage account. Data Factory natively integrates with Azure Event Grid, which lets you trigger pipelines on such events. Reference: https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers

Answer 80

**Correct Answer: B 🗳️** Automated Databricks clusters are the best for jobs and automated batch processing. Note: Azure Databricks has two types of clusters: interactive and automated. You use interactive clusters to analyze data collaboratively with interactive notebooks. You use automated clusters to run fast and robust automated jobs. Example: Scheduled batch workloads (data engineers running ETL jobs) This scenario involves running batch job JARs and notebooks on a regular cadence through the Databricks platform. The suggested best practice is to launch a new cluster for each run of critical jobs. This helps avoid any issues (failures, missing SLA, and so on) due to an existing workload (noisy neighbor) on a shared cluster. Reference: https://docs.microsoft.com/en-us/azure/databricks/clusters/create https://docs.databricks.com/administration-guide/cloud-configurations/aws/cmbp.html#scenario-3-scheduled-batch-workloads-data-engineers-running-etl-jobs

Answer 81

100% correct **Box 1: MAX -** The first step on the query finds the maximum time stamp in 10-minute windows, that is the time stamp of the last event for that window. The second step joins the results of the first query with the original stream to find the event that match the last time stamps in each window. **Box 2: TumblingWindow -** Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. **Box 3: DATEDIFF -** DATEDIFF is a date-specific function that compares and returns the time difference between two DateTime fields, for more information, refer to date functions. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Answer 82

**Correct Answer: A 🗳️** Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or Completed. Consider Pipeline1: If we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1 failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory. The failure dependency means this pipeline reports success. Note: If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is successful. In this scenario, if Activity1 fails, the pipeline will fail. Reference: https://datasavvy.me/category/azure-data-factory/

Answer 83

**Ingest: Azure Data Factory -** Azure Data Factory pipelines can execute SSIS packages. In Azure, the following services and tools will meet the core requirements for pipeline orchestration, control flow, and data movement: Azure Data Factory, Oozie on HDInsight, and SQL Server Integration Services (SSIS). **Store: Data Lake Storage -** Data Lake Storage Gen1 provides unlimited storage. Note: Data at rest includes information that resides in persistent storage on physical media, in any digital format. Microsoft Azure offers a variety of data storage solutions to meet different needs, including file, disk, blob, and table storage. Microsoft also provides encryption to protect Azure SQL Database, Azure Cosmos DB, and Azure Data Lake. **Prepare and Train: Azure Databricks** Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration. With Azure Databricks, you can set up your Apache Spark environment in minutes, autoscale and collaborate on shared projects in an interactive workspace. Azure Databricks supports Python, Scala, R, Java and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch and scikit-learn. **Model and Serve: Azure Synapse Analytics** Azure Synapse Analytics/ SQL Data Warehouse stores data into relational tables with columnar storage. Azure SQL Data Warehouse connector now offers efficient and scalable structured streaming write support for SQL Data Warehouse. Access SQL Data Warehouse from Azure Databricks using the SQL Data Warehouse connector. Note: As of November 2019, Azure SQL Data Warehouse is now Azure Synapse Analytics. Reference: https://docs.microsoft.com/bs-latn-ba/azure/architecture/data-guide/technology-choices/pipeline-orchestration-data-movement https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks

Answer 84

**Box 1: CASE -** CASE evaluates a list of conditions and returns one of multiple possible result expressions. CASE can be used in any statement or clause that allows a valid expression. For example, you can use CASE in statements such as SELECT, UPDATE, DELETE and SET, and in clauses such as select_list, IN, WHERE, ORDER BY, and HAVING. Syntax: Simple CASE expression: CASE input_expression - WHEN when_expression THEN result_expression [ ...n ] [ ELSE else_result_expression ] END - **Box 2: ELSE -** Reference: https://docs.microsoft.com/en-us/sql/t-sql/language-elements/case-transact-sql

Answer 85

**Box 1: openrowset -** The easiest way to see to the content of your CSV file is to provide file URL to OPENROWSET function, specify csv FORMAT. **Box 2: openjson -** You can access your JSON files from the Azure File Storage share by using the mapped drive, as shown in the following example: Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-single-csv-file https://docs.microsoft.com/en-us/sql/relational-databases/json/import-json-documents-into-sql-server

Answer 86

**Box 1: PIVOT -** PIVOT rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output. And PIVOT runs aggregations where they're required on any remaining column values that are wanted in the final output. Incorrect Answers: UNPIVOT carries out the opposite operation to PIVOT by rotating columns of a table-valued expression into column values. **Box 2: CAST -** If you want to convert an integer value to a DECIMAL data type in SQL Server use the CAST() function. Example: SELECT - CAST(12 AS DECIMAL(7,2) ) AS decimal_value; Here is the result: decimal_value 12.00 Reference: https://learnsql.com/cookbook/how-to-convert-an-integer-to-a-decimal-in-sql-server/ https://docs.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot

Answer 87

**Correct Answer: D 🗳️** Annotations are additional, informative tags that you can add to specific factory resources: pipelines, datasets, linked services, and triggers. By adding annotations, you can easily filter and search for specific factory resources. Reference: https://www.cathrinewilhelmsen.net/annotations-user-properties-azure-data-factory/

Answer 88

**1. Yes** A cluster mode of ‘High Concurrency’ is selected, unlike all the others which are ‘Standard’. This results in a worker type of Standard_DS13_v2. ref: https://adatis.co.uk/databricks-cluster-sizing/ **2. NO** recommended: New Job Cluster. When you run a job on a new cluster, the job is treated as a data engineering (job) workload subject to the job workload pricing. When you run a job on an existing cluster, the job is treated as a data analytics (all-purpose) workload subject to all-purpose workload pricing. ref: https://docs.microsoft.com/en-us/azure/databricks/jobs Scheduled batch workload- Launch new cluster via job ref: https://docs.databricks.com/administration-guide/capacity-planning/cmbp.html#plan-capacity-and-control-cost **3.YES** Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns. ref: https://docs.databricks.com/delta/index.html

Answer 89

**My answer will be B** Stream Analytics supports "extending SQL language with JavaScript and C# user-defined functions (UDFs)". There is no mention of Python support; hence Stream Analytics is not correct. https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction Azure Databricks supports near real-time data from Azure Event Hubs. And includes support for R, SQL, Python, Scala, and Java. So I will go for option B.

Answer 90

Range Left or Right, both are creating similar partition but there is difference in comparison For example: in this scenario, when you use LEFT and 20100101,20110101,20120101 Partition will be, datecol<=20100101, datecol>20100101 and datecol<=20110101, datecol>20110101 and datecol<=20120101, datecol>20120101 But if you **use range RIGHT** and **20100101,20110101,20120101** Partition will be, datecol<20100101, datecol>=20100101 and datecol<20110101, datecol>=20110101 and datecol<20120101, datecol>=20120101 In this example, Range RIGHT will be suitable for calendar comparison Jan 1st to Dec 31st Reference: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql?view=sql-server-ver15

Answer 91

**Correct Answer: BE 🗳️** A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD. This type of tracking may be used for one or two columns in a dimension table. It is not common to use it for many members of the same table. It is often used in combination with Type 1 or Type 2 members. Reference: https://k21academy.com/microsoft-azure/azure-data-engineer-dp203-q-a-day-2-live-session-review/

Answer 92

Majority beleive **The answer should be "Yes**". Hopping window with hop size equals window size should be the same as Tumbling window.

Answer 93

**Correct Answer: B 🗳️** Instead use a tumbling window. Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. if the hop size is equivalent to the window size then it can be true, but because the hop size is smaller, then each tweet can be count more than one and the windows will overlap with each others. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Answer 94

**Box 1: DATEDIFF -** DATEDIFF function returns the count (as a signed integer value) of the specified datepart boundaries crossed between the specified startdate and enddate. Syntax: DATEDIFF ( datepart , startdate, enddate ) **Box 2: LAST -** The LAST function can be used to retrieve the last event within a specific condition. In this example, the condition is an event of type Start, partitioning the search by PARTITION BY user and feature. This way, every user and feature is treated independently when searching for the Start event. LIMIT DURATION limits the search back in time to 1 hour between the End and Start events. Example: SELECT - [user], feature, DATEDIFF( second, LAST(Time) OVER (PARTITION BY [user], feature LIMIT DURATION(hour, 1) WHEN Event = 'start'), Time) as duration - FROM input TIMESTAMP BY Time - WHERE - Event = 'end' Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-stream-analytics-query-patterns

Answer 95

**Correct Answer: AB 🗳️** **A. To the data flow, add a sink transformation to write the rows to a file in blob storage.** This action ensures that the rows causing truncation errors, identified by the Conditional Split, are written to a file in blob storage. This meets the requirement of storing rows that would otherwise cause truncation errors upon insertion. **B. To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors.** The Conditional Split helps identify rows that may cause truncation errors based on specified conditions (in this case, the comment column). This separation allows handling these problematic rows separately. B: Example: 1. This conditional split transformation defines the maximum length of "title" to be five. Any row that is less than or equal to five will go into the GoodRows stream. Any row that is larger than five will go into the BadRows stream. 2. This conditional split transformation defines the maximum length of "title" to be five. Any row that is less than or equal to five will go into the GoodRows stream. Any row that is larger than five will go into the BadRows stream. A: 3. Now we need to log the rows that failed. Add a sink transformation to the BadRows stream for logging. Here, we'll "auto-map" all of the fields so that we have logging of the complete transaction record. This is a text-delimited CSV file output to a single file in Blob Storage. We'll call the log file "badrows.csv". 4. The completed data flow is shown below. We are now able to split off error rows to avoid the SQL truncation errors and put those entries into a log file. Meanwhile, successful rows can continue to write to our target database. Reference: https://docs.microsoft.com/en-us/azure/data-factory/how-to-data-flow-error-rows

Answer 96

The conditional split transformation routes data rows to different streams based on matching conditions. The conditional split transformation is similar to a CASE decision structure in a programming language. The transformation evaluates expressions, and based on the results, directs the data row to the specified stream. **Box 1: dept=='ecommerce', dept=='retail', dept=='wholesale'** First we put the condition. The order must match the stream labeling we define in Box 3. **Box 2:** [THIS IS DISPUTED](https://www.examtopics.com/discussions/microsoft/view/54230-exam-dp-203-topic-2-question-20-discussion/) **Majority say disjoint: true** **I think disjoint: false** as the arguments and sources are more compelling disjoint is false because the data goes to the first matching condition. All remaining rows matching the third condition go to output stream all. **Box 3: ecommerce, retail, wholesale, all** Label the streams - Reference: https://docs.microsoft.com/en-us/azure/data-factory/data-flow-conditional-split

Answer 97

**[disputed](https://www.examtopics.com/discussions/microsoft/view/52573-exam-dp-203-topic-2-question-21-discussion/), need to research, go with this if nothing else** To accomplish the task in an Azure Databricks notebook, the logical sequence of actions would be: Step 1. **Mount the Data Lake Storage onto DBFS**: This allows access to the JSON file stored in Azure Data Lake Storage using the Databricks File System. Step 2. **Read the file into a data frame**: Use Spark to read the JSON file into a DataFrame for processing. Step 3. **Perform transformations on the data frame**: Apply transformations to concatenate the FirstName and LastName fields to create a new column. Step 4. **Specify a temporary folder to stage the data**: Before writing the data to Azure Synapse, it is a common practice to stage it in a temporary folder. Step 5. **Write the results to a table in Azure Synapse**: Finally, write the transformed DataFrame to the destination table in Azure Synapse Analytics. These steps would ensure the JSON file data is properly transformed and loaded into Azure Synapse Analytics for further use.

Answer 98

**Box 1: Tumbling window -** To be able to use the Delay parameter we select Tumbling window. **Box 2: Recurrence: 30 minutes, not 32 minutes. Delay: 2 minutes.** The amount of time to delay the start of data processing for the window. The pipeline run is started after the expected execution time plus the amount of delay. The delay defines how long the trigger waits past the due time before triggering a new run. The delay doesn't alter the window startTime. Reference: https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger

Answer 99

**Azure Event Hub** **Microsoft Power BI** **Azure Stream Analytics**

Answer 100

**2. Add an Azure Stream Analytics Custom Deserializer Project (.NET) project to the solution.** **3. Add .NET deserializer code for Protobuf to the custom deserializer project.** Popular beleive in chat is that this next **1. Change the Event Serialization Format to Protobuf in the input.json file of the job and reference the DLL**.

Answer 101

**Correct Answer: A 🗳️** Incorrect Answers: C: Self-hosted integration runtime is to be used On-premises. Reference: https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime

Answer 102

**HubA: Stream -** **HubB: Stream -** **Database1: Reference -** Reference data (also known as a lookup table) is a finite data set that is static or slowly changing in nature, used to perform a lookup or to augment your data streams. For example, in an IoT scenario, you could store metadata about sensors (which don't change often) in reference data and join it with real time IoT data streams. Azure Stream Analytics loads reference data in memory to achieve low latency stream processing Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data

Answer 103

**Correct Answer: B 🗳️** Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window. Example: Incorrect Answers: A: Sliding windows, unlike Tumbling or Hopping windows, output events only for points in time when the content of the window actually changes. In other words, when an event enters or exits the window. Every window has at least one event, like in the case of Hopping windows, events can belong to more than one sliding window. C: Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap, so events can belong to more than one Hopping window result set. To make a Hopping window the same as a Tumbling window, specify the hop size to be the same as the window size. D: Session windows group events that arrive at similar times, filtering out periods of time where there is no data. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Answer 104

**Box 1: LAG -** The LAG analytic operator allows one to look up a ג€previousג€ event in an event stream, within certain constraints. It is very useful for computing the rate of growth of a variable, detecting when a variable crosses a threshold, or when a condition starts or stops being true. **Box 2: LIMIT DURATION -** Example: Compute the rate of growth, per sensor: SELECT sensorId, growth = reading - LAG(reading) OVER (PARTITION BY sensorId LIMIT DURATION(hour, 1)) FROM input - Reference: https://docs.microsoft.com/en-us/stream-analytics-query/lag-azure-stream-analytics

Answer 105

**Correct Answer: D 🗳️** Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events. Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the arrival or deletion of a file in Azure Blob Storage account. Reference: https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger

Answer 106

**Correct Answer: C 🗳️** In Azure Data Factory, continuous integration and delivery (CI/CD) means moving Data Factory pipelines from one environment (development, test, production) to another. Note: The following is a guide for setting up an Azure Pipelines release that automates the deployment of a data factory to multiple environments. 1. In Azure DevOps, open the project that's configured with your data factory. 2. On the left side of the page, select Pipelines, and then select Releases. 3. Select New pipeline, or, if you have existing pipelines, select New and then New release pipeline. 4. In the Stage name box, enter the name of your environment. 5. Select Add artifact, and then select the git repository configured with your development data factory. Select the publish branch of the repository for the Default branch. By default, this publish branch is adf_publish. 6. Select the Empty job template. Reference: https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment

Answer 107

**Correct Answer: B 🗳️** Stream Analytics supports Azure Blob storage and Azure SQL Database as the storage layer for Reference Data. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data

Answer 108

**Correct Answer: C 🗳️** Unlike tumbling windows, hopping windows model scheduled overlapping windows. A hopping window specification consist of three parameters: the timeunit, the windowsize (how long each window lasts) and the hopsize (by how much each window moves forward relative to the previous one). Reference: https://docs.microsoft.com/en-us/stream-analytics-query/hopping-window-azure-stream-analytics

Answer 109

**Box 1: Azure Stream Analytics -** Box 2 is disputed **Box 2: No Window -** The "No window" option is chosen to ensure that Stream Analytics processes each event individually without aggregating them into windows. This setup allows for immediate processing of each GPS position. **Box 3: Point within polygon -** Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Answer 110

**Correct Answer: B 🗳️** The Databricks ABS-AQS connector uses Azure Queue Storage (AQS) to provide an optimized file source that lets you find new files written to an Azure Blob storage (ABS) container without repeatedly listing all of the files. This provides two major advantages: ✑ Lower latency: no need to list nested directory structures on ABS, which is slow and resource intensive. ✑ Lower costs: no more costly LIST API requests made to ABS. Reference: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/aqs

Answer 111

Disputed **Box 1: fail until the node comes back online** We see: High Availability Enabled: False Note: Higher availability of the self-hosted integration runtime so that it's no longer the single point of failure in your big data solution or cloud data integration with Data Factory. **Box 2: lowered -** We see: Concurrent Jobs (Running/Limit): 2/14 CPU Utilization: 6% Note: When the processor and available RAM aren't well utilized, but the execution of concurrent jobs reaches a node's limits, scale up by increasing the number of concurrent jobs that a node can run Reference: https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime

Answer 112

**Correct Answer: B 🗳️** Selected Answer: B We definitely need "Optimized Autoscaling" (not Standard Autoscaling) which is only part of Premium Plan. B is the correct answer. Automated (job) clusters always use optimized autoscaling. The type of autoscaling performed on all-purpose clusters depends on the workspace configuration. Standard autoscaling is used by all-purpose clusters in workspaces in the Standard pricing tier. Optimized autoscaling is used by all-purpose clusters in the Azure Databricks Premium Plan. https://docs.databricks.com/clusters/cluster-config-best-practices.html For clusters running Databricks Runtime 6.4 and above, optimized autoscaling is used by all-purpose clusters in the Premium plan Optimized autoscaling: Scales up from min to max in 2 steps. Can scale down even if the cluster is not idle by looking at shuffle file state. Scales down based on a percentage of current nodes. On job clusters, scales down if the cluster is underutilized over the last 40 seconds. On all-purpose clusters, scales down if the cluster is underutilized over the last 150 seconds. The spark.databricks.aggressiveWindowDownS Spark configuration property specifies in seconds how often a cluster makes down-scaling decisions. Increasing the value causes a cluster to scale down more slowly. The maximum value is 600. Note: Standard autoscaling - Starts with adding 8 nodes. Thereafter, scales up exponentially, but can take many steps to reach the max. You can customize the first step by setting the spark.databricks.autoscaling.standardFirstStepUp Spark configuration property. Scales down only when the cluster is completely idle and it has been underutilized for the last 10 minutes. Scales down exponentially, starting with 1 node. Reference: https://docs.databricks.com/clusters/configure.html

Answer 113

**Correct Answer: A 🗳️** Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. The following diagram illustrates a stream with a series of events and how they are mapped into 10-second tumbling windows. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

Answer 114

**Correct Answer: B 🗳️** Instead use a tumbling window. Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics **Variations of this question** Solution: You use a hopping window that uses a hop size of 10 seconds and a window size of 10 seconds. **YES** Solution: You use a hopping window that uses a hop size of 5 seconds and a window size 10 seconds. **NO** Solution: You use a session window that uses a timeout size of 10 seconds. Does this meet the goal? **NO** Solution: You use a tumbling window, and you set the window size to 10 seconds. Does this meet the goal? **YES**

Answer 115

**Correct Answer: D 🗳️** Hopping window functions hop forward in time by a fixed period. It may be easy to think of them as Tumbling windows that can overlap and be emitted more often than the window size. Events can belong to more than one Hopping window result set. To make a Hopping window the same as a Tumbling window, specify the hop size to be the same as the window size. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Answer 116

DISPUTED **Box 1: adf_publish -** The Publish branch is the branch in your repository where publishing related ARM templates are stored and updated. By default, it's adf_publish. **Box 2: / dwh_batchetl/adf_publish/contososales** Note: RepositoryName (here dwh_batchetl): Your Azure Repos code repository name. Azure Repos projects contain Git repositories to manage your source code as your project grows. You can create a new repository or use an existing repository that's already in your project. Reference: https://docs.microsoft.com/en-us/azure/data-factory/source-control

Answer 117

**Box 1: timestamp by -** **Box 2: TUMBLINGWINDOW -** Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Answer 118

**P1) Set the Partition option to Dynamic range** according to chat **p2) set the copy method to PolyBase** according to chat

Answer 119

**1) Incremental Load** **2) Tumbling Window** some debate for this to be scheduled Seems like you could go with either Schedule trigger or Tumbling Window here. I would use the latter option, and pass the windowStart system variable to the pipeline as a parameter, allowing me to more easily navigate to the proper directory in the storage account.

Answer 120

**Correct Answer: B 🗳️** We need a High Concurrency cluster for the data engineers and the jobs. Note: Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL. A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies. Reference: https://docs.azuredatabricks.net/clusters/configure.html

Answer 121

**Correct Answer: C 🗳️** Schedule trigger: A trigger that invokes a pipeline on a wall-clock schedule. Reference: https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers

Answer 122

So unsure on this one **1- Create a Database Encryption Key** This is disputed in chat, but seems like most are going for this **Step 2: a database scoped credential** Create a Database Scoped Credential. A Database Scoped Credential is a record that contains the authentication information required to connect an external resource. The master key needs to be created first before creating the database scoped credential. **Step 3: an external data source -** Create an External Data Source. External data sources are used to establish connectivity for data loading using Polybase. Reference: https://www.sqlservercentral.com/articles/access-external-data-from-azure-synapse-analytics-using-polybase

Answer 123

**Correct Answer: D 🗳️** Watermark Delay indicates the delay of the streaming data processing job. There are a number of resource constraints that can cause the streaming pipeline to slow down. The watermark delay metric can rise due to: 1. Not enough processing resources in Stream Analytics to handle the volume of input events. To scale up resources, see Understand and adjust Streaming Units. 2. Not enough throughput within the input event brokers, so they are throttled. For possible solutions, see Automatically scale up Azure Event Hubs throughput units. 3. Output sinks are not provisioned with enough capacity, so they are throttled. The possible solutions vary widely based on the flavor of output service being used. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-time-handling

Answer 124

**Box 1: TopOne OVER(PARTITION BY Game ORDER BY Score Desc)** TopOne returns the top-rank record, where rank defines the ranking position of the event in the window according to the specified ordering. Ordering/ranking is based on event columns and can be specified in ORDER BY clause. **Box 2: TumblingWindow(minute, 5)** according to chat - This window function groups the events into non-overlapping, continuous five-minute intervals, which is what's required to get the highest score in each five-minute time slice. This configuration will ensure that you get the highest score for each game every five minutes. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/topone-azure-stream-analytics https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Answer 125

**Correct Answer B:** The solution proposed does not meet the goal because it suggests executing the R script using a stored procedure in the data warehouse. Azure Synapse Analytics does not support executing R scripts directly within stored procedures. Instead, you should use Azure Data Factory to orchestrate the process, using an Azure Machine Learning activity to execute the R script for data transformation before loading the transformed data into Azure Synapse Analytics. **VARIATIONS OF THIS QUESTION** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. **NO** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse. **YES** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse. **NO** Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse. **YES**

Answer 126

data scientist need scala so standard, jobs need scala so standard, so **B** but for different reasons

Answer 127

**Correct Answer: B 🗳️** A High Concurrency cluster is a managed cloud resource. The key benefits of High Concurrency clusters are that they provide fine-grained sharing for maximum resource utilization and minimum query latencies. Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling. Autoscaling makes it easier to achieve high cluster utilization, because you don't need to provision the cluster to match a workload. Incorrect Answers: C: The cluster configuration includes an auto terminate setting whose default value depends on cluster mode: Standard and Single Node clusters terminate automatically after 120 minutes by default. High Concurrency clusters do not terminate automatically by default. Reference: https://docs.microsoft.com/en-us/azure/databricks/clusters/configure B is correct answer. High concurrency cluster cannot terminated, so C is wrong. Standard cluster cannot shared by multiple tasks, so A and D are wrong.

Answer 128

**Box 1: @trigger().outputs.windowStartTime -** startTime: A date-time value. For basic schedules, the value of the startTime property applies to the first occurrence. For complex schedules, the trigger starts no sooner than the specified startTime value. **Box 2: /{YYYY}/{MM}/{DD}/{HH}_{deviceType}.json** One dataset per hour per deviceType. **Box 3: merge -** not flatten hierarchy. The question starts with a folder structure as the following: /{deviceType}/in/{YYYY}/{MM}/{DD}/{HH}/{deviceID}_{YYYY}{MM}{DD}HH}{mm}.json It indicates there are multiple device ID JSON files per deviceType. Those need to be merged to get the target naming pattern - "one file per device type per hour." The target naming pattern is the following: /{YYYY}/{MM}/{DD}/{HH}_{deviceType}.json The correct copy behavior is "Merge" because there are multiple files in the source folder that are merged into a single folder per device type per hour.

Answer 129

The correct answer is **{raw/regionID}/{YYYY}/{MM}/{DD}/{HH}/{mm}/{deviceID}.json** {raw/regionID} is the first level because raw is the container name for the raw data. RegionID follows it for ease of managing security. {YYYY}/{MM}/{DD}/{HH}/{mm}/{deviceID}.json instead of {deviceID}/{YYYY}/{MM}/{DD}/{HH}/{mm}.json. The primary reason is that you want your namespace structure to have as few folders as high up and narrow those down as you get deeper into your structure. For example, if you have 1 year worth of data and 25 million devices, using {YYYY}/{MM}/{DD}/{HH}/{mm}/ results in 2.1 million folders (1 year * 12 months * 30 days [estimate] * 24 hours * 60 minutes). If you start your folder structure with {deviceID}, you end up with 25 million folders - one for each device - before you even get to including the date in the hierarchy.

Answer 130

**Box 1: WHERE EventType='HeartBeat'** **Box 2: SessionWindow(second, 5, 50000) OVER (PARTITION BY DeviceID)** according to chat If we want to calculate the uptime between the faults, we must use session window for each device, we know that will be receiving events for each 5 seconds if there is no error, so when an error occurs (or if we reach the maximum size of the window) then a new event will not be received within the next 5 seconds and the window will close, calculating the uptime. However if We use Tumbling window, it´s not possible to calculate the uptime beyond 5 seconds

Answer 131

Correct Answer: A 🗳️ To change the language in Databricks' cells to either Scala, SQL, Python or R, prefix the cell with '%', followed by the language. %python //or r, scala, sql Reference: https://www.theta.co.nz/news-blogs/tech-blog/enhancing-digital-twins-part-3-predictive-maintenance-with-azure-databricks

Answer 132

**Correct Answer: D 🗳️** In case of pipeline failures, tumbling window trigger can retry the execution of the referenced pipeline automatically, using the same input parameters, without the user intervention. This can be specified using the property "retryPolicy" in the trigger definition. Reference: https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger

Answer 133

**Correct Answer: AC 🗳️** Copy only the daily files by using filtering. Reference: https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-data-lake-storage A. Specify a file naming pattern for the destination: By specifying a file naming pattern for the destination files in the Azure Data Lake Storage Gen2 account, you can ensure that the files are organized and stored in a structured manner. This can help with data management and subsequent processing. C. Filter by the last modified date of the source files: By filtering the source files based on the last modified date, you can select only the files that have been modified on the current day. This reduces the amount of data transferred and improves the efficiency of the data load process.

Answer 134

**Correct Answer: C 🗳️** Append Mode: Only new rows appended in the result table since the last trigger are written to external storage. This is applicable only for the queries where existing rows in the Result Table are not expected to change. Incorrect Answers: B: Complete Mode: The entire updated result table is written to external storage. It is up to the storage connector to decide how to handle the writing of the entire table. A: Update Mode: Only the rows that were updated in the result table since the last trigger are written to external storage. This is different from Complete Mode in that Update Mode outputs only the rows that have changed since the last trigger. If the query doesn't contain aggregations, it is equivalent to Append mode. Reference: https://docs.databricks.com/getting-started/spark/streaming.html

Answer 135

**This is the only variation of this question thats YES** **Correct Answer: A 🗳️** **Use the derived column transformation** to generate new columns in your data flow or to modify existing fields. Reference: https://docs.microsoft.com/en-us/azure/data-factory/data-flow-derived-column "Data flows are available both in Azure Data Factory and Azure Synapse Pipelines" "Use the derived column transformation to generate new columns in your data flow or to modify existing fields." https://docs.microsoft.com/en-us/azure/data-factory/data-flow-derived-column

Answer 136

**Correct Answer: B 🗳️** Instead use the derived column transformation to generate new columns in your data flow or to modify existing fields. Reference: https://docs.microsoft.com/en-us/azure/data-factory/data-flow-derived-column Selected Answer: B Answer should be B. An external table is based on a source flat file structure. It seems to make no sense to add additional date time columns to such a table.

Answer 137

You can't use serverless pool to create table in dedicate pool **Correct Answer: B 🗳️** Instead use the derived column transformation to generate new columns in your data flow or to modify existing fields. Reference: https://docs.microsoft.com/en-us/azure/data-factory/data-flow-derived-column

Answer 138

**Variations of this question** Solution: In an Azure Synapse Analytics pipeline, you use a Get Metadata activity that retrieves the DateTime of the files. **NO** Solution: You use an Azure Synapse Analytics serverless SQL pool to create an external table that has an additional DateTime column. **NO** Solution: You use a dedicated SQL pool to create an external table that has an additional DateTime column. **NO** Solution: In an Azure Synapse Analytics pipeline, you use a data flow that contains a Derived Column transformation. **YES** **Correct Answer: B 🗳️** Instead use a serverless SQL pool to create an external table with the extra column. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-external-tables

Answer 139

**Selected Answer: A** I think is A. Yes. You can execute R code in a notebook, and then call it from Data Factory. You can check it at "Databricks Notebook activity" header: https://docs.microsoft.com/en-US/azure/data-factory/transform-data And also: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/sparkr/overview **VARIATIONS OF THIS QUESTION** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. **NO** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse. **YES** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse. **NO** Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse. **YES**

Answer 140

**Correct Answer: B 🗳️** **If you need to transform data in a way that is not supported by Data Factory**, you can create a custom activity, not a mapping flow,5 with your own data processing logic and use the activity in the pipeline. You can create a custom activity to run R scripts on your HDInsight cluster with R installed. Reference: https://docs.microsoft.com/en-US/azure/data-factory/transform-data Selected Answer: B Is correct. Mapping Dataflows can't execute R code that is a requeriment, so not meet the goal. **VARIATIONS OF THIS QUESTION** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. **NO** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse. **YES** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse. **NO** Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse. **YES**

Answer 141

**Variations of this question** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. **NO** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse. **YES** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse. **NO** Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse. **YES** **Selected Answer: A** The correct answer is "A. Yes" You can execute R code in a notebook, and then call it from Data Factory. You can check it at "Databricks Notebook activity" header: https://docs.microsoft.com/en-US/azure/data-factory/transform-data And also: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/sparkr/overview **VARIATIONS OF THIS QUESTION** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that copies the data to a staging table in the data warehouse, and then uses a stored procedure to execute the R script. **NO** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes an Azure Databricks notebook, and then inserts the data into the data warehouse. **YES** Solution: You use an Azure Data Factory schedule trigger to execute a pipeline that executes mapping data flow, and then inserts the data into the data warehouse. **NO** Solution: You schedule an Azure Databricks job that executes an R notebook, and then inserts the data into the data warehouse. **YES**

Answer 142

**Correct Answer: D 🗳️** Use the flatten transformation to take array values inside hierarchical structures such as JSON and unroll them into individual rows. This process is known as denormalization. Reference: https://docs.microsoft.com/en-us/azure/data-factory/data-flow-flatten

Answer 143

**Correct Answer: D 🗳️** Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them, such as the example below. The key differentiators of a Tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Answer 144

**C. Append** For the given scenario, where sales transactions are never updated but new rows are added to adjust a sale, the recommended output mode for the dataset processed by using Structured Streaming in Azure Databricks is "Append". The "Append" output mode ensures that only new rows are added to the output data as they arrive in the streaming data source. It appends the new rows to the existing result without modifying or updating previously processed data. This mode is suitable when you want to continuously append new records to the output data without duplicating or modifying existing data. In this case, as new rows are added to adjust a sale, the "Append" mode will capture these new rows and include them in the output data, allowing you to aggregate the line total sales amount and line total tax amount in Databricks while minimizing duplicate data.

Answer 145

almost always azure monitor but not this time The question asks for transaction log size on each distribution. **The correct answer is D**: Link below: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-monitor -- Transaction log size SELECT instance_name as distribution_db, cntr_value*1.0/1048576 as log_file_size_used_GB, pdw_node_id FROM sys.dm_pdw_nodes_os_performance_counters WHERE instance_name like 'Distribution_%' AND counter_name = 'Log File(s) Used Size (KB)' Selected Answer: D D is totally correct. Link has this very clearly mentioned https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-monitor

Answer 146

**Correct Answer: B 🗳️** You can identify anomalies by routing data via IoT Hub to a built-in ML model in Azure Stream Analytics. Reference: https://docs.microsoft.com/en-us/learn/modules/data-anomaly-detection-using-azure-iot-hub/

Answer 147

**Correct Answer: C 🗳️** C-->it measures the amount of delay in the processing of the input events. If the watermark delay increases, it could indicate that the Stream Analytics job is not able to keep up with the incoming data and may not have enough processing resources to handle the additional load. There are a number of resource constraints that can cause the streaming pipeline to slow down. The watermark delay metric can rise due to: ✑ Not enough processing resources in Stream Analytics to handle the volume of input events. ✑ Not enough throughput within the input event brokers, so they are throttled. ✑ Output sinks are not provisioned with enough capacity, so they are throttled. The possible solutions vary widely based on the flavor of output service being used. Incorrect Answers: A: Deserialization issues are caused when the input stream of your Stream Analytics job contains malformed messages. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-time-handling

Answer 148

**Box 1: Hash -** Consider using a hash-distributed table when: The table size on disk is more than 2 GB. The table has frequent insert, update, and delete operations. **Box 2: Clustered columnstore -** Clustered columnstore tables offer both the highest level of data compression and the best overall query performance. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index

Answer 149

**Correct Answer: AB 🗳️** To react to increased workloads and increase streaming units, consider setting an alert of 80% on the SU Utilization metric. Also, you can use watermark delay and backlogged events metrics to see if there is an impact. Note: Backlogged Input Events: Number of input events that are backlogged. A non-zero value for this metric implies that your job isn't able to keep up with the number of incoming events. If this value is slowly increasing or consistently non-zero, you should scale out your job, by increasing the SUs. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-monitoring

Answer 150

**Correct Answer: A 🗳️** Monitor activity runs. To get a detailed view of the individual activity runs of a specific pipeline run, click on the pipeline name. Example: The list view shows activity runs that correspond to each pipeline run. Hover over the specific activity run to get run-specific information such as the JSON input, JSON output, and detailed activity-specific monitoring experiences. You can check the Duration. Incorrect Answers: C: sys.dm_pdw_wait_stats holds information related to the SQL Server OS state related to instances running on the different nodes. Reference: https://docs.microsoft.com/en-us/azure/data-factory/monitor-visually

Answer 151

The error message says a missing file, which matches with **answer B: missing data from 06:00**. The process had re-tried three times, 15 mins apart, which explains that the error was generated 07:45.

Answer 152

almost always azure monitor but not this time **Correct Answer: C 🗳️** Use Synapse Studio to monitor your Apache Spark applications. To monitor running Apache Spark application Open Monitor, then select Apache Spark applications. To view the details about the Apache Spark applications that are running, select the submitting Apache Spark application and view the details. If the Apache Spark application is still running, you can monitor the progress. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/monitoring/apache-spark-applications

Answer 153

**Step 1: Mount the Data Lake Storage onto DBFS** Begin with creating a file system in the Azure Data Lake Storage Gen2 account. **Step 2: Read the file into a data frame.** You can load the json files as a data frame in Azure Databricks. **Step 3: Perform transformations on the data frame.** **Step 4: Specify a temporary folder to stage the data** Specify a temporary folder to use while moving data between Azure Databricks and Azure Synapse. **Step 5: Write the results to a table in Azure Synapse.** You upload the transformed data frame into Azure Synapse. You use the Azure Synapse connector for Azure Databricks to directly upload a dataframe as a table in a Azure Synapse. Reference: https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse

Answer 154

**Selected Answer: AB** They are asking to "implement version control". B Create Git repo A From the UX Set up code repository

Answer 155

**Create a schedule trigger.** **Associate the schedule trigger with pipeline1.** **Merge the changes from branch1 into main.** **Publish the contents of main.** you should associate the trigger before merge the code into main, because schedule also is part of code. all code store in main, do not change it directly, that is the purpose of version control.

Answer 156

Answer: **abfss** and **Hadoop** Hint: Storage1 requires secure transfers --> The default option is to use enable secure SSL connections when provisioning Azure Data Lake Storage Gen2. When this is enabled, you must use **abfss when a secure TLS/SSL** connection is selected. Reference: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=azure-sqldw-latest&preserve-view=true&tabs=dedicated

Answer 157

**Selected Answer: C** You wont be able to create restore point when the SQL pool is paused. The the correct answer is Result SQL Pool. See below from Microsoft documentation. User-defined restore points can also be created through Azure portal. Sign in to your Azure portal account. Navigate to the dedicated SQL pool (formerly SQL DW) that you want to create a restore point for. Select Overview from the left pane, select + New Restore Point. If the New Restore Point button isn't enabled, make sure that the dedicated SQL pool (formerly SQL DW) isn't paused. https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-restore-points

Answer 158

**Correct Answer: BE 🗳️** Synapse workspaces encryption uses existing keys or new keys generated in Azure Key Vault. A single key is used to encrypt all the data in a workspace. Synapse workspaces support RSA 2048 and 3072 byte-sized keys, and RSA-HSM keys. The Key Vault itself needs to have purge protection enabled. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/security/workspaces-encryption

Answer 159

**Correct Answer: C 🗳️** Security - User must have SELECT permission on an external table to read the data. External tables access underlying Azure storage using the database scoped credential defined in data source. Note: A database scoped credential is a record that contains the authentication information that is required to connect to a resource outside SQL Server. Most credentials include a Windows user and password. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables https://docs.microsoft.com/en-us/sql/t-sql/statements/create-database-scoped-credential-transact-sql

Answer 160

**Correct Answer: A 🗳️** CI/CD lifecycle - 1. A development data factory is created and configured with Azure Repos Git. All developers should have permission to author Data Factory resources like pipelines and datasets. 2. A developer creates a feature branch to make a change. They debug their pipeline runs with their most recent changes 3. After a developer is satisfied with their changes, they create a pull request from their feature branch to the main or collaboration branch to get their changes reviewed by peers. 4. After a pull request is approved and changes are merged in the main branch, the changes get published to the development factory. Reference: https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-delivery

Answer 161

**Correct Answer: D 🗳️** Correct Answer: D Offset of "-01:00:00" indicates to start the next trigger instance only after the previous trigger instance completes, and size of "01:00:00" indicates to wait for 1 hour after the previous trigger instance completes before starting the next one. Tumbling window self-dependency properties In scenarios where the trigger shouldn't proceed to the next window until the preceding window is successfully completed, build a self-dependency. A self- dependency trigger that's dependent on the success of earlier runs of itself within the preceding hour will have the properties indicated in the following code. Example code: "name": "DemoSelfDependency", "properties": { "runtimeState": "Started", "pipeline": { "pipelineReference": { "referenceName": "Demo", "type": "PipelineReference" } }, "type": "TumblingWindowTrigger", "typeProperties": { "frequency": "Hour", "interval": 1, "startTime": "2018-10-04T00:00:00Z", "delay": "00:01:00", "maxConcurrency": 50, "retryPolicy": { "intervalInSeconds": 30 }, "dependsOn": [ { "type": "SelfDependencyTumblingWindowTriggerReference", "size": "01:00:00", "offset": "-01:00:00" } ] } } } Reference: https://docs.microsoft.com/en-us/azure/data-factory/tumbling-window-trigger-dependency

Answer 162

**Box 1: A Get Metadata activity -** Dynamically size data flow compute at runtime The Core Count and Compute Type properties can be set dynamically to adjust to the size of your incoming source data at runtime. Use pipeline activities like Lookup or Get Metadata in order to find the size of the source dataset data. Then, use Add Dynamic Content in the Data Flow activity properties. **Box 2: Dynamic content -** *Correct : Use pipeline activities like Lookup or Get Metadata in order to find the size of the source dataset data. Then, use Add Dynamic Content in the Data Flow activity properties. You can choose small, medium, or large compute sizes. Optionally, pick "Custom" and configure the compute types and number of cores manually.* To optimize the number of cores used by Dataflow1 based on the size of the files in storage1: For Pipeline1, add: - **A Get Metadata activity** This activity helps retrieve metadata about the files in storage1, allowing you to assess their size and properties. For Dataflow1, set the core count by using: - **Dynamic content** By using dynamic content, you can dynamically adjust the core count of Dataflow1 based on the metadata retrieved from the Get Metadata activity in Pipeline1. This dynamic adjustment will enable you to optimize the core count according to the size or properties of the files present in storage1. Reference: https://docs.microsoft.com/en-us/azure/data-factory/control-flow-execute-data-flow-activity

Answer 163

**A. Yes** - - data engineers: high concurrency cluster - jobs: Standard cluster - data scientists: Standard cluster

Answer 164

High-concurrency clusters do not support Scala. So the **answer is still 'No'** but the reasoning is wrong. https://docs.microsoft.com/en-us/azure/databricks/clusters/configure **Variations of this questions** All instances of this question should be **Standard**, **High**, **Standard**

Answer 165

**Correct Answer: A 🗳️** Department top level in the hierarchy to simplify security management. Month (MM) at the leaf/bottom level to support fast data retrieval for data from the current month.

Answer 166

**Correct Answer: BD 🗳️** **Selected Answer: BD B. Convert the avg_c column into a calculated column. D. Enable result set caching.** *Explanation: A calculated column is a column that uses an expression to calculate its value based on other columns in the same table. In this case, the udfFtoC function can be used to calculate the avgc value based on the avgtemperature column, eliminating the need to call the UDF in the SELECT statement. Enabling result set caching can improve query performance by caching the result set of the query, so subsequent queries that use the same parameters can be retrieved from the cache instead of executing the query again. Creating an index on the avgf column or the sensorid column is not useful because there are no join or filter conditions on these columns in the WHERE clause. Changing the table distribution to replicate is also not necessary because it does not affect the query performance in this scenario* D: When result set caching is enabled, dedicated SQL pool automatically caches query results in the user database for repetitive use. This allows subsequent query executions to get results directly from the persisted cache so recomputation is not needed. Result set caching improves query performance and reduces compute resource usage. In addition, queries using cached results set do not use any concurrency slots and thus do not count against existing concurrency limits. Incorrect: Not A, not C: No joins so index not helpful. Not E: What is a replicated table? A replicated table has a full copy of the table accessible on each Compute node. Replicating a table removes the need to transfer data among Compute nodes before a join or aggregation. Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed. 2 GB is not a hard limit. If the data is static and does not change, you can replicate larger tables. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-result-set-caching

Answer 167

To process streaming data from Azure Event Hub and enable interactive querying for analysts while storing the data in Azure Data Lake Storage, the most appropriate option is: **B. Structured Streaming in Azure Databricks** Azure Databricks, with its capability for Structured Streaming, provides a powerful platform for processing real-time streaming data. It offers a unified analytics platform that integrates with Azure services, allowing the ingestion of data from Azure Event Hubs. Structured Streaming provides a high-level API for building continuous and interactive data processing pipelines over streaming data. By utilizing Azure Databricks notebooks, analysts can interactively query and analyze the data in real-time as it flows through the pipeline. This option enables continuous data processing, real-time querying, and seamless integration with Azure Data Lake Storage, making it an effective choice for the specified requirements.

Answer 168

The Spark SQL function used to explode a nested JSON array into multiple rows is: **A. explode** The explode function in Spark SQL is specifically designed to flatten arrays in DataFrames. When applied to a column containing nested JSON arrays, it expands each element of the array into its own row, enabling processing and querying of individual elements within the nested structure.

Answer 169

df_sales.filter(col("Region") == "HQ") **.groupBy(col('SalesPerson'))** .agg(sum('Amount').alias('TotalAmount')) **.orderBy(desc('TotalAmount'))** .limit(3)

Answer 170

**Correct Answer: D 🗳️** Community vote distribution

Answer 171

**Configure a code repo and select Repo1** **Create a new brach** **Create pipeline artifacts and save them in the new branch** **Create pipeline artifacts and save them in the new branch.**

Answer 172

**Selected Answer: BC** **B. Lookup:** The Lookup activity can be used to read the contents of File1.txt from the storage account. It will retrieve the names of selected tables in DB1 as parameter values for the Copy activity. **C. ForEach:** The ForEach activity can be used to iterate over the retrieved table names from File1.txt. Inside the loop, you can configure the Copy activity with the source and destination information based on the current table name.

Answer 173

To ensure the updated lineage information from the Data Factory pipeline is available in Microsoft Purview, you should execute the pipeline first **(option B)**. Executing the pipeline triggers the data movement or processing defined within it. This action generates or updates the lineage information associated with the pipeline activities, allowing these changes to reflect in the connected Microsoft Purview account.

Answer 174

**Answer is C** Find reason here: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-push-lineage-to-purview#run-pipeline-and-push-lineage-data-to-microsoft-purview The answer is also displayed on the top right corner of the image displayed.

Answer 175

**Going against discussion with this anwser** To review the schema and lineage information in Microsoft Purview for the data referenced by DS1 in DF1, you can use the following: **A. The search bar in the Microsoft Purview governance portal**i s a suitable feature to locate and review the schema and lineage information associated with the data referenced by DS1. This portal provides extensive search capabilities for discovering and exploring metadata, including lineage. **B. The Storage browser of storage1 in the Azure portal** primarily deals with storage account-related management and browsing data within the storage account. While it allows access to the files and containers within storage1, it might not provide detailed lineage and schema information available in Microsoft Purview. Therefore, options A (the search bar in the Microsoft Purview governance portal) and B (the Storage browser of storage1 in the Azure portal) are the features you can use to locate the schema and lineage information for the data referenced by DS1 in MP1.

Answer 176

**Storage: Timeslice partitioning in the folders** **Format: Apache Parquet** Time partitioning is correct as the fastest way to load only new files, but requires that the timeslice information be part of the file or folder name (https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview) However, Parquet is the correct file format since it's a columnar format For minimizing the load time and optimizing the incremental loads for an Azure Synapse Analytics workspace, you should consider using the following configurations: Storage: Multiple containers in the blob storage account: This allows for better organization and management of files. It doesn't necessarily speed up the load process but contributes to better organization, which can indirectly impact performance. Timeslice partitioning in the folders: This partitioning strategy helps in organizing data by time, which can significantly speed up incremental loads by enabling the system to efficiently locate and ingest only the new or changed data. Format: Apache Parquet: Parquet is a columnar storage format that offers great performance benefits, especially for analytics workloads. It provides efficient compression and speeds up query performance by minimizing the data read during query execution. This format is ideal for incremental loads and analytics operations. Therefore, the optimal choices for storage and format to minimize load time and optimize incremental loads would be:

Answer 177

**BEGIN TRAN** **ROLLBACK TRAN** Given answer is wrong. It should be BEGIN TRAN as SQL pool in Azure Synapse Analytics does not support distributed transaction. https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-transactions "Limitations SQL pool does have a few other restrictions that relate to transactions. They are as follows: No distributed transactions No nested transactions permitted No save points allowed No named transactions No marked transactions No support for DDL such as CREATE TABLE inside a user-defined transaction " Distributed Transactions are only allowed in SQL Server and Azure SQL Managed Instance: https://learn.microsoft.com/de-de/sql/t-sql/language-elements/begin-distributed-transaction-transact-sql?view=sql-server-ver16

Answer 178

For retrieving the watermark value, you should use the **Lookup** activity to fetch the most recent WatermarkValue from the Watermark table in DB2. For performing the upload based on the watermark value, you should use the **Copy data** activity. This activity will handle the incremental loading of data from Table1 in DB1 to Azure Blob Storage, utilizing the watermark value retrieved from the Lookup activity. So, the identified activities are: To retrieve the watermark value: **Lookup** To perform the upload: **Copy data**

Answer 179

Correct. It's weird but best way to open a json is as a **csv** and with **0x0b** for fieldterminator and fieldquote. [https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files)

Answer 180

**C. the ability to save without publishing** **D. the ability to save pipelines that have validation issues** When you integrate Data Factory and GitHub, you can save your pipelines to a GitHub repository without publishing them to Azure. This allows you to work on your pipelines in a development environment and then publish them to Azure when you are ready. You can also save pipelines that have validation issues. This is because GitHub does not validate your pipelines when you save them. This allows you to work on your pipelines and fix the validation issues before you publish them to Azure.

Answer 181

Create a pull request: **Feature** Publish the changes: **Collaboration**

Answer 182

**Selected Answer: D** Just use Built-in copy task, according to: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-lastmodified-copy-data-tool

Answer 183

The runtime used to execute the Data Flow activity within an Azure Data Factory pipeline is the Azure Integration Runtime (IR). This runtime environment is provided by Azure Data Factory and is responsible for executing the data transformation logic defined in the Data Flow activity. So, the correct answer is: **A. Azure Integration runtime**

Answer 184

**%%spark** && **df.write.synapsesql**

Answer 185

**Correct Answer: A 🗳️** Community vote distribution Selected Answer: A Web Activity https://learn.microsoft.com/en-us/azure/data-factory/solution-template-synapse-notebook

Answer 186

Debated - mostly confusing discussion - gonna go against chat Anwser not clear after investigation When working in a feature branch, changes to the linked service will be published to the live service **upon saving the changes** A Copy activity that uses the linked service as the source will perform the Copy activity **in the region of the source database** **SOME CHAT SUGGEST** 1. upon publishing changes to the service 2. in the region of data factory According to Microsoft, AutoResolveIntegrationRuntime will attempt to use the sink location to get an IR in the same region (or the closest available) to execute the Copy activity, not the source location. I would go with the region of data factory, since that is the default option when the sink's location is not detectable. Source: https://learn.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime#azure-ir-location

Answer 187

**1. two times** **2. will automatically adjust** "For time zones that observe daylight saving, trigger time will auto-adjust for the twice a year change, if the recurrence is set to Days or above. To opt out of the daylight saving change, please select a time zone that does not observe daylight saving, for instance UTC." https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-schedule-trigger?tabs=data-factory#azure-data-factory-and-synapse-portal-experience

Answer 188

Gonna go against chat- COULD BE B OR C To execute a stored procedure in an Azure Synapse Analytics dedicated SQL pool and use the returned result set as input for a downstream activity, you should use the **Stored Procedure** activity in the pipeline. This activity is specifically designed to execute stored procedures within the dedicated SQL pool, making it the appropriate choice for this scenario. CHAT CLAIMS **Selected Answer: C** For me the correct answer is C. The store procedure activity doesn't return any data. In the description of the script activity is written that it can be used for : "Run stored procedures. If the SQL statement invokes a stored procedure that returns results from a temporary table, use the WITH RESULT SETS option to define metadata for the result set. " https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-script

Answer 189

**Disputed quite a lot, going with Chatgpt and 5/11 of chat** To execute a stored procedure (SP1) in Azure SQL Database and store the result set into pipeline variables, you can use the **Lookup** activity or the **Stored Procedure** activity in Azure Data Factory. Both activities can be employed to run stored procedures. However, the Lookup activity is ideal when you want to retrieve a single row or a single value, which aligns with your scenario of obtaining a single row with four columns. Additionally, the Lookup activity can efficiently fetch the values from the stored procedure and store them as pipeline variables for further use in subsequent activities. The Stored Procedure activity, on the other hand, can also execute the stored procedure, but it's primarily used for performing actions in the SQL Database without returning results that are stored as pipeline variables. Hence, for your specific requirement of storing values as pipeline variables, the Lookup activity is the more suitable choice.

Answer 190

To implement version control for the changes made to pipeline artifacts in Azure Data Factory (ADF1), you can perform the following actions: **C. Create a Git repository:** Set up a Git repository where you'll manage version control for your Azure Data Factory resources. This will serve as the central repository to store and track changes. **E. From the Azure Data Factory Studio**, select Set up code repository: Within Azure Data Factory Studio, you can set up the link between your Data Factory and the Git repository you've created. This allows you to connect your Data Factory to the Git repository for version control. These actions enable you to establish version control for your Azure Data Factory resources by connecting it to a Git repository, thereby allowing you to track changes, manage branches, and collaborate on changes with your team while maintaining version history for your artifacts.

Answer 191

For the requirements you've outlined, a Tumbling Window Trigger in Azure Data Factory would be the most appropriate choice: **D. Tumbling Window** Here's why: Backfill data from the beginning of the day to the current time: Tumbling window triggers can specify a start and end date/time. You can set it up to start at the beginning of the day and execute until the current time. Re-execution on failure within the same 30-minute period: Tumbling window triggers allow you to define a recurrence interval. If the pipeline fails, it can re-execute within the same window, adhering to the specified schedule. Ensure only one concurrent execution: You can configure a tumbling window trigger to allow only one concurrent run. This ensures that at any given time, only one instance of the pipeline runs. Minimize development and configuration effort: Tumbling window triggers offer flexibility in defining recurring schedules and handling re-execution upon failure. This minimizes the need for custom logic or extensive configurations outside of the trigger definition.

Answer 192

**Correct Answer: A 🗳️** For querying data efficiently in an Apache Spark pool within Azure Synapse Analytics while focusing on minimizing query execution time and data processing, Parquet is the most suitable format among the options provided. Here's why: Efficient Query Performance: Parquet is known for its columnar storage format, making it highly optimized for analytical queries. This format stores data in columnar fashion, which allows for skipping irrelevant data quickly and performing more targeted reads during queries. Compression and Encoding: Parquet uses compression and encoding techniques, reducing the data size on disk. Smaller data size means less I/O overhead, which directly contributes to faster query execution. Retrieval of Entire Records: Parquet's columnar storage doesn't prevent the retrieval of entire records. It allows efficient access to multiple rows and retrieves the required columns' data in its entirety. Considering these factors, Parquet aligns well with the requirements for minimizing query execution time and data processing while allowing the retrieval of multiple rows of records efficiently.

Answer 193

**Correct Answer: B 🗳️** The trigger type you need to use to initiate pipeline execution upon a file deletion event from a container in Azure Blob Storage is storage event. Using a storage event trigger in Azure Data Factory allows you to monitor changes within your Azure Blob Storage containers, such as file deletions, and initiate pipeline executions in response to these specific events. This choice minimizes development effort as it directly responds to changes within the storage, triggering the pipeline without the need for custom coding or complex configurations.

Answer 194

**Source code: main** **ARM template output: adf_publish**

Answer 195

Here's one possible sequence: 1. **Create a database user in dw1 that represents Group1 and uses the FROM EXTERNAL PROVIDER clause.** 2. **Create a database role named Role1 and grant Role1 SELECT permissions to schema1.** 3. **Assign Role1 to the Group1 database user.** This sequence involves creating a role granting read-only access to the specific schema, creating a user representing the Azure AD group, and finally assigning the role to that user to enable read access to the specified schema for the group.

Answer 196

Here are the correct choices for your requirements: To track encryption key usage: *** TDE with customer-managed keys** To maintain client app access in the event of a datacenter outage: *** Create and configure Azure key vaults in two Azure regions.** These selections ensure encryption key tracking and provide control over encryption keys while maintaining client app access across regions in the event of a datacenter outage.

Answer 197

**Correct Answer: AC 🗳️** To efficiently identify queries returning confidential information and users executing them, these components should be part of your solution: **A. Sensitivity-classification labels applied to columns containing confidential information** - Helps to mark and identify sensitive data. **C. Audit logs sent to a Log Analytics workspace** - Allows monitoring and tracking query execution and the users who ran them, aiding in compliance and auditing. These components ensure the sensitive data is identified and audited, assisting in tracking user access and complying with data privacy regulations. A: You can classify columns manually, as an alternative or in addition to the recommendation-based classification: 1. Select Add classification in the top menu of the pane. 2. In the context window that opens, select the schema, table, and column that you want to classify, and the information type and sensitivity label. 3. Select Add classification at the bottom of the context window. C: **An important aspect of the information-protection paradigm is the ability to monitor access to sensitive data.** Azure SQL Auditing has been enhanced to include a new field in the audit log called data_sensitivity_information. This field logs the sensitivity classifications (labels) of the data that was returned by a query. Here's an example: Reference: https://docs.microsoft.com/en-us/azure/azure-sql/database/data-discovery-and-classification-overview

Answer 198

**Correct Answer: C 🗳️** Column-level security simplifies the design and coding of security in your application, allowing you to restrict column access to protect sensitive data. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/column-level-security

Answer 199

Correct Answer: ACE 🗳️ To implement role-based access control (RBAC) for Azure Data Lake Storage resources so that project members can manage them, you should perform the following actions: **A. Create security groups in Azure Active Directory (Azure AD) and add project members** to these groups. This allows easy management of permissions by assigning roles to these groups rather than individual users. **C. Assign Azure AD security groups to Azure Data Lake Storage**. Grant appropriate permissions (such as Read, Write, or Manage) by assigning roles to these security groups at the Azure Data Lake Storage level. **E. Configure access control lists (ACL) for the Azure Data Lake Storage account.** ACLs are used to set granular permissions on individual files and folders within the data lake storage, allowing specific access control. So, the correct actions are A, C, and E to properly set up RBAC for managing Azure Data Lake Storage resources. AC: Create security groups in Azure Active Directory. Assign users or security groups to Data Lake Storage Gen1 accounts. E: Assign users or security groups as ACLs to the Data Lake Storage Gen1 file system Reference: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-secure-data

Answer 200

**Correct Answer: C 🗳️** Yes, the answer is C, which involves removing the linked service from `Df1`. The reasoning is that Azure Data Factory (ADF) doesn't provide a direct encryption feature for its resources using keys from Azure Key Vault. While Azure Key Vault can manage and store keys, they're typically used to encrypt data within the data storage or as keys for services like Azure Storage or Azure SQL Database. ADF linked services themselves aren't directly encrypted using keys from Key Vault. Therefore, in this context, the first logical step would involve removing the linked service from `Df1`. *I believe this is correct, based on the question: What should you do FIRST? A DF needs to be empty to be encrypted: https://docs.microsoft.com/en-us/azure/data-factory/enable-customer-managed-key#post-factory-creation-in-data-factory-ui So FIRST we need to empty the DF - then we can move on.* Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. Incorrect Answers: D: A self-hosted integration runtime copies data between an on-premises store and cloud storage. Reference: https://docs.microsoft.com/en-us/azure/data-factory/enable-customer-managed-key https://docs.microsoft.com/en-us/azure/data-factory/concepts-linked-services https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime

Answer 201

**Correct Answer: D 🗳️** In the context of auditing access to Personally Identifiable Information (PII), the suitable feature to include in the solution is **D. sensitivity classifications**. Sensitivity classifications help in identifying and categorizing data based on its sensitivity level. They allow you to label columns or tables in your Azure Synapse Analytics dedicated SQL pool as containing PII or other sensitive data types. This enables you to track access and take measures to protect the data, aligning with compliance and data privacy requirements. While the other options (column-level security, dynamic data masking, row-level security) offer data protection mechanisms, sensitivity classifications are specifically designed to identify and manage sensitive data for auditing purposes. Data Discovery & Classification is built into Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics. It provides basic capabilities for discovering, classifying, labeling, and reporting the sensitive data in your databases. Your most sensitive data might include business, financial, healthcare, or personal information. Discovering and classifying this data can play a pivotal role in your organization's information-protection approach. It can serve as infrastructure for: ✑ Helping to meet standards for data privacy and requirements for regulatory compliance. ✑ Various security scenarios, such as monitoring (auditing) access to sensitive data. ✑ Controlling access to and hardening the security of databases that contain highly sensitive data. Reference: https://docs.microsoft.com/en-us/azure/azure-sql/database/data-discovery-and-classification-overview

Answer 202

To configure access to the storage account for the Azure Data Factory in compliance with the specified requirements: Use **Microsoft Azure Active Directory (Azure AD)** to authenticate by using **a managed identity**. This setup aligns with the principle of least privilege by employing Azure AD authentication, ensuring secure and controlled access to the data lake resources. Additionally, utilizing a managed identity helps in reducing maintenance efforts by handling the credentials automatically and securely within the Azure environment.

Answer 203

For the Analysts in RegionA: Dynamic data masking rules need to be applied for [Patients_RegionA]. This would be a **YES** as Analysts in RegionA need access to in-region sensitive data, which includes financial and PII data. For the Engineers in RegionC: Engineers in RegionC require a dynamic data masking rule for [Patients_RegionA], [Height]. This would be a **NO** as RegionC does not consider height as sensitive data. For the Engineers in RegionB: Engineers in RegionB require a dynamic data masking rule for [Patients_RegionB], [Height]. This would be a **YES** as RegionB considers height as sensitive data alongside financial and PII information.

Answer 204

**Step 1: Assign a managed identity to Server1** You will need an existing Managed Instance as a prerequisite. **Step 2: Create an Azure key vault and grant the managed identity permissions to the vault** Create Resource and setup Azure Key Vault. **Step 3: Add key1 to the Azure key vault** The recommended way is to import an existing key from a .pfx file or get an existing key from the vault. Alternatively, generate a new key directly in Azure Key Vault. **Step 4: Configure key1 as the TDE protector for Server1** Provide TDE Protector key - **Step 5: Enable TDE on Pool1 -** Reference: https://docs.microsoft.com/en-us/azure/azure-sql/managed-instance/scripts/transparent-data-encryption-byok-powershell

Answer 205

**Correct Answer: B 🗳️** Azure SQL Database currently supports encryption at rest for Microsoft-managed service side and client-side encryption scenarios. ✑ Support for server encryption is currently provided through the SQL feature called Transparent Data Encryption. ✑ Client-side encryption of Azure SQL Database data is supported through the Always Encrypted feature. Reference: https://docs.microsoft.com/en-us/azure/security/fundamentals/encryption-atrest

Answer 206

**Correct Answer: A 🗳️** You can't change the partition count for an event hub after its creation except for the event hub in a dedicated cluster. Reference: https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-features

Answer 207

**Correct Answer: B 🗳️** A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated tables since joins on replicated tables don't require data movement. Replication requires extra storage, though, and isn't practical for large tables. Incorrect Answers: A: A hash distributed table is designed to achieve high performance for queries on large tables. C: A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly. Loading data into a round-robin table is fast. Keep in mind that queries can require more data movement than the other distribution methods. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview

Answer 208

df.write .**partitionBy**(**"GeographyRegionID", "Year", "Month", "Day"**) .mode("append") .**parquet**("/DBTBL1")

Answer 209

Debated in chat, go withchat anwser **A. a security policy** **C. a predicate function** **Suggested Answer: AB 🗳️** A: Row-Level Security (RLS) enables you to use group membership or execution context to control access to rows in a database table. Implement RLS by using the CREATE SECURITY POLICYTransact-SQL statement. B: Azure Synapse provides a comprehensive and fine-grained access control system, that integrates: Azure roles for resource management and access to data in storage, ✑ Synapse roles for managing live access to code and execution, ✑ SQL roles for data plane access to data in SQL pools. Reference: https://docs.microsoft.com/en-us/sql/relational-databases/security/row-level-security https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-access-control-overview ALTERNATE VIEW To enforce row-level security based on companies, you'd need to employ a predicate function and a security policy. So, the correct options are: **C. a predicate function** **A. a security policy** Some debate in chat to use BA

Answer 210

**Selected Answer: A** Go with A, reason for not B, if email column is string type ,default masking will make it as xxxxxxxx, so here I go with email mask on email column. https://learn.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview?view=azuresql

Answer 211

**Correct Answer: D 🗳️** Managed Identity authentication is required when your storage account is attached to a VNet. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/quickstart-bulk-load-copy-tsql-examples

Answer 212

**Box 1: GRANT -** You can implement column-level security with the GRANT T-SQL statement. With this mechanism, both SQL and Azure Active Directory (Azure AD) authentication are supported. **Box 2: CREATE SECURITY POLICY -** Implement RLS by using the CREATE SECURITY POLICY Transact-SQL statement, and predicates created as inline table-valued functions. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/column-level-security https://docs.microsoft.com/en-us/sql/relational-databases/security/row-level-security

Answer 213

**Box 1: Premium -** Credential passthrough requires an Azure Databricks Premium Plan **Box 2: Azure Data Lake Storage credential passthrough** You can access Azure Data Lake Storage using Azure Active Directory credential passthrough. When you enable your cluster for Azure Data Lake Storage credential passthrough, commands that you run on that cluster can read and write data in Azure Data Lake Storage without requiring you to configure service principal credentials for access to storage. Reference: https://docs.microsoft.com/en-us/azure/databricks/security/credential-passthrough/adls-passthrough

Answer 214

**Correct Answer: A 🗳️** Managed Identity authentication is required when your storage account is attached to a VNet. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/quickstart-bulk-load-copy-tsql-examples

Answer 215

**Correct Answer: B 🗳️** A shared access signature (SAS) provides secure delegated access to resources in your storage account. With a SAS, you have granular control over how a client can access your data. For example: What resources the client may access. What permissions they have to those resources. How long the SAS is valid. Reference: https://docs.microsoft.com/en-us/azure/storage/common/storage-sas-overview

Answer 216

Going with Chat Accessing the ADLS via Databricks should be using **Azure Active Directory with Passthrough.** Accessing the files in ADLS should be **SAS**, based on the options provided.

Answer 217

**Correct Answer: E 🗳️** Dynamic data masking helps prevent unauthorized access to sensitive data by enabling customers to designate how much of the sensitive data to reveal with minimal impact on the application layer. It's a policy-based security feature that hides the sensitive data in the result set of a query over designated database fields, while the data in the database is not changed. Reference: https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview

Answer 218

**Correct Answer: C 🗳️** Use Always Encrypted to secure the required columns. You can configure Always Encrypted for individual database columns containing your sensitive data. Always Encrypted is a feature designed to protect sensitive data, such as credit card numbers or national identification numbers (for example, U.S. social security numbers), stored in Azure SQL Database or SQL Server databases. Reference: https://docs.microsoft.com/en-us/sql/relational-databases/security/encryption/always-encrypted-database-engine

Answer 219

Like 4 different combinations suggested for this, going with most likes **Selected Answer: CD** **C. Access ג€" Execute** **D. Default ג€" Read** Phrased different, the question for me says: if you create "Folder3" inside Folder2, you should be able to read files created in Folder3. This means that you for sure need Executive and Read premissions to Folder2 (Executive to traverse child folder, read to read the files). Now, starting from the least privilege, suppose you give "Access" permission both for read and execute. In this case, you can't read files created in Folder3. This is a requirement ("child items that are created in Folder2"), so you need Default Read access. You don't need Default Execute, otherwise you would have access to a Folder created in Folder3 (say Folder 4) and this is not required so for the least privilege you must give Access Execute and not Defualt Execute.

Answer 220

**Box 1: Azure AD authentication -** Azure AD authentication has the option to include MFA. **Box 2: Contained database users -** Azure AD authentication uses contained database users to authenticate identities at the database level. Reference: https://docs.microsoft.com/en-us/azure/azure-sql/database/authentication-mfa-ssms-overview https://docs.microsoft.com/en-us/azure/azure-sql/database/authentication-aad-overview

Answer 221

Step 1: Create a Log Analytics workspace that has Data Retention set to 120 days. Step 2: From Azure Portal, add a diagnostic setting. Step 3: Select the PipelineRuns Category Step 4: Send the data to a Log Analytics workspace.

Answer 222

Correct Answer: B 🗳️ Transparent Data Encryption (TDE) helps protect against the threat of malicious activity by encrypting and decrypting your data at rest. When you encrypt your database, associated backups and transaction log files are encrypted without requiring any changes to your applications. TDE encrypts the storage of an entire database by using a symmetric key called the database encryption key. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-overview-manage-security

Answer 223

**Box 1: Execute -** If you are granting permissions by using only ACLs (no Azure RBAC), then to grant a security principal read or write access to a file, you'll need to give the security principal Execute permissions to the root folder of the container, and to each folder in the hierarchy of folders that lead to the file. **Box 2: Execute -** On Directory: Execute (X): Required to traverse the child items of a directory **Box 3: Write -** On file: Write (W): Can write or append to a file. Reference: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control

Answer 224

**Box 1: Azure SQL Database -** Use external Hive Metastore for Synapse Spark Pool Azure Synapse Analytics allows Apache Spark pools in the same workspace to share a managed HMS (Hive Metastore) compatible metastore as their catalog. Set up linked service to Hive Metastore Follow below steps to set up a linked service to the external Hive Metastore in Synapse workspace. 1. Open Synapse Studio, go to Manage > Linked services at left, click New to create a new linked service. 2. Set up Hive Metastore linked service 3. Choose Azure SQL Database or Azure Database for MySQL based on your database type, click Continue. 4. Provide Name of the linked service. Record the name of the linked service, this info will be used to configure Spark shortly. 5. You can either select Azure SQL Database/Azure Database for MySQL for the external Hive Metastore from Azure subscription list, or enter the info manually. 6. Provide User name and Password to set up the connection. 7. Test connection to verify the username and password. 8. Click Create to create the linked service. **Box 2: A Hive Metastore -** Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-external-metastore

Answer 225

**Box1: Locally-redundant storage (LRS)** In the question, it specifically states that "You need to deploy an Azure Data Lake Storage Gen2 Premium account", and Azure Data Lake Storage Gen2 premium tier is neither an Archive access tier nor a Cool Access tier, and so those two options are out. Locally-redundant storage (LRS) is less expensive than Zone-redundant storage (ZRS), so we choose LRS. https://learn.microsoft.com/en-us/azure/storage/blobs/premium-tier-for-data-lake-storage **Box 2: Azure Storage lifecycle management** With the lifecycle management policy, you can: * Delete current versions of a blob, previous versions of a blob, or blob snapshots at the end of their lifecycles. Transition blobs from cool to hot immediately when they're accessed, to optimize for performance. Transition current versions of a blob, previous versions of a blob, or blob snapshots to a cooler storage tier if these objects haven't been accessed or modified for a period of time, to optimize for cost. In this scenario, the lifecycle management policy can move objects from hot to cool, from hot to archive, or from cool to archive. Etc. Reference: https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview https://docs.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview

Answer 226

**Box 1: Hot -** The data will be accessed several times a day during the first 30 days after the data is created. The data must meet an availability SLA of 99.9%. **Box 2: Cool -** After 90 days, the data will be accessed infrequently but must be available within 30 seconds. Data in the Cool tier should be stored for a minimum of 30 days. When your data is stored in an online access tier (either Hot or Cool), users can access it immediately. The Hot tier is the best choice for data that is in active use, while the Cool tier is ideal for data that is accessed less frequently, but that still must be available for reading and writing. **Box 3: Cool -** After 365 days, the data will be accessed infrequently but must be available within five minutes. Incorrect: Not Archive: While a blob is in the Archive access tier, it's considered to be offline and can't be read or modified. In order to read or modify data in an archived blob, you must first rehydrate the blob to an online tier, either the Hot or Cool tier. Rehydration priority - When you rehydrate a blob, you can set the priority for the rehydration operation via the optional x-ms-rehydrate-priority header on a Set Blob Tier or Copy Blob operation. Rehydration priority options include: Standard priority: The rehydration request will be processed in the order it was received and may take up to 15 hours. High priority: The rehydration request will be prioritized over standard priority requests and may complete in less than one hour for objects under 10 GB in size. Reference: https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview https://docs.microsoft.com/en-us/azure/storage/blobs/archive-rehydrate-overview

Answer 227

**1. Role-based access control (RBAC) rules** **2. Access control lists (ACLs)** Access Control Lists (ACLs) are a way of defining and managing permissions on individual objects within a storage system like Azure Data Lake Storage Gen 2. ACLs provide a granular level of control by specifying who can do what with specific files or directories. In the context of the scenario you provided, using ACLs would allow you to apply additional permissions to individual objects within the 'storage1' account. This means you can specify different permissions (like read, write, execute) for different users, groups, or security principals on specific files or folders within the storage account. Meanwhile, Role-Based Access Control (RBAC) in Azure allows you to assign permissions to users, groups, or applications at different scopes (like subscription, resource group, or individual resources). RBAC roles, when assigned at the storage account level, can grant list and read permissions across the entire storage account. Therefore, RBAC roles would be used to grant permissions at the storage account level (for list and read permissions), and ACLs would be used to grant additional permissions at the object level (specific files or folders) within the storage account 'storage1'.

Answer 228

**Correct Answer: D 🗳️** Im pretty sure

Answer 229

Answer is **container1: -- X permissions** **directory1: -WX permissions** **file1: --- permissions** [ref](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control)

Answer 230

The appropriate role to assign to Group1 to grant read access to container1 with the principle of least privilege is option B, Storage Blob Data Reader for container1. Option A, Storage Table Data Reader for myaccount1, is incorrect because it grants read access to all tables in the storage account, not just container1. Option C, Storage Blob Data Reader for myaccount1, is incorrect because it grants read access to all containers in the storage account, not just container1. Option D, Storage Table Data Reader for container1, is incorrect because it grants read access to tables in the specified container only, not blobs in container1. Therefore, **option B, Storage Blob Data Reader** for container1, is the most appropriate role to assign Group1 to grant read access to container1 with the principle of least privilege.

Answer 231

The appropriate feature to use to prevent a group of users from reading user email addresses from dbo.Users in an Azure Synapse Analytics dedicated SQL pool is option A, column-level security. Option B, row-level security (RLS), is used to filter rows in a table based on the user executing a query, but it cannot prevent certain columns from being read by a group of users. Option C, Transparent Data Encryption (TDE), encrypts data at rest and does not prevent a group of users from reading specific columns in a table. Option D, dynamic data masking, is used to mask sensitive data in query results, but it does not prevent a group of users from reading the actual values in a column. Therefore, **option A, column-level security,** is the most appropriate feature to use to prevent a group of users from reading user email addresses from dbo.Users in an Azure Synapse Analytics dedicated SQL pool. Column-level security can be used to deny read access to specific columns in a table based on a user or group's permissions.

Answer 232

**Credit card numbers: Dynamic Data Masking** **Tax numbers: Column-level secuirty** this is dabated, some chat say rls and gpt says rls

Answer 233

Correct Answer: B 🗳️ To access data stored in Azure Storage within Azure Synapse Analytics, you should create a **database scoped credential** first. This credential is used to securely store the account key or access key needed to authenticate and access the Azure Storage account. Once you've set up the database scoped credential, you can then proceed to create the external data source, which references this credential for authentication when accessing the CSV file. [Refer this link](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-external-tables)

Answer 234

GO WITH **C: Reference** according to chat and documentation **Selected Answer: D** When you're using the OPENROWSET function to read data from the storage account, you're actually performing a read operation, not an execute operation. The credential is used implicitly by Azure Synapse to authenticate the session with the storage account and does not require the EXECUTE permission for the user or group accessing it. Instead, you grant permissions that are appropriate for data access. In this case, the SELECT permission is the correct one to use because it allows the members of Group1 to read or select the data. For granting access to read CSV files from storage1 using the OPENROWSET function with the principle of least privilege, the appropriate permission to grant to Group1 would be **D. SELECT**. This permission allows reading data from the specified table or view. This aligns with providing read access to the group without giving broader control or execution rights, ensuring members can perform necessary read operations without additional privileges.

Answer 235

Given the scenario of a large fact table and the requirement to optimize performance by distributing the data across multiple nodes in Azure Synapse Analytics, the appropriate technology to use is: **B. hash distributed table with clustered Columnstore index** This approach provides efficient data distribution across nodes (using hashing) and benefits from the performance advantages of a clustered Columnstore index for large analytical workloads.

Answer 236

Given the scenario with a large fact table in Azure Synapse Analytics containing 5 billion rows and 50 columns, where most queries aggregate values from around 100 million rows and return only two columns, the most suitable index to improve query performance would be: **B. clustered columnstore index** Clustered columnstore indexes are ideal for analytical workloads, especially when performing aggregations on a large volume of data. They provide excellent compression and can significantly enhance query performance for such scenarios by minimizing I/O and speeding up aggregations due to their columnar storage format and batch-based processing capabilities.

Answer 237

To identify the cause of the missing library issue when attempting to load it into an Azure Databricks notebook, reviewing the **cluster event logs (option B)** would be the most appropriate step. These logs often contain valuable information about the cluster's activity, including events related to library installation or initialization failures. It could highlight any errors or issues encountered during the installation of the additional library onto the cluster.

Answer 238

**Correct Answer: D 🗳️** Data Factory stores pipeline-run data for only 45 days. Use Azure Monitor if you want to keep that data for a longer time. Reference: https://docs.microsoft.com/en-us/azure/data-factory/monitor-using-azure-monitor

Answer 239

**Correct Answer: C 🗳️** **Increase the streaming units for the job (C)**: If the workload is consistently high and causing backlog due to processing limitations, scaling up by increasing the streaming units might help distribute the load and process events more efficiently. General symptoms of the job hitting system resource limits include: ✑ If the backlog event metric keeps increasing, it's an indicator that the system resource is constrained (either because of output sink throttling, or high CPU). Note: Backlogged Input Events: Number of input events that are backlogged. A non-zero value for this metric implies that your job isn't able to keep up with the number of incoming events. If this value is slowly increasing or consistently non-zero, you should scale out your job: adjust Streaming Units. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-scale-jobs https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-monitoring

Answer 240

**Correct Answer: A 🗳️** If you want to retain the cluster configuration indefinitely after the cluster is terminated to minimize costs, you should choose: **A. Pin the cluster.** Pinning the cluster will retain its configuration details, ensuring that the configuration is preserved even after the cluster is terminated. This approach allows you to retain the cluster's configuration without incurring costs for an actively running cluster. Azure Databricks retains cluster configuration information for up to 70 all-purpose clusters terminated in the last 30 days and up to 30 job clusters recently terminated by the job scheduler. To keep an all-purpose cluster configuration even after it has been terminated for more than 30 days, an administrator can pin a cluster to the cluster list. Reference: https://docs.microsoft.com/en-us/azure/databricks/clusters/

Answer 241

**Correct Answer: C 🗳️** To ensure that the automated data loads have enough memory available to complete quickly and successfully when concurrent ad hoc queries run, you should: **C. Assign a larger resource class to the automated data load queries.** Assigning a larger resource class to the automated data load queries will allocate more resources (like CPU, memory, etc.) to these queries, ensuring they have the necessary resources to execute efficiently, especially when running concurrently with ad hoc queries. This allocation will help prevent contention and resource competition between the ad hoc queries and the data load processes, improving the overall performance and completion time of the data loads. The performance capacity of a query is determined by the user's resource class. Resource classes are pre-determined resource limits in Synapse SQL pool that govern compute resources and concurrency for query execution. Resource classes can help you configure resources for your queries by setting limits on the number of queries that run concurrently and on the compute- resources assigned to each query. There's a trade-off between memory and concurrency. Smaller resource classes reduce the maximum memory per query, but increase concurrency. Larger resource classes increase the maximum memory per query, but reduce concurrency. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/resource-classes-for-workload-management

Answer 242

To identify the extent of data skew in the fact table Table1 within the Azure Synapse Analytics dedicated SQL pool, you should: **D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats.** Using the sys.dm_pdw_nodes_db_partition_stats DMV (Dynamic Management View) within Synapse Studio connected to Pool1 enables you to retrieve statistics about the distribution of data across distributions and compute nodes. This will help you analyze the distribution of data across nodes and detect any potential data skew in the table partitions.

Answer 243

For collecting application metrics, streaming query events, and application log messages for an Azure Databricks cluster: **Library: Azure Databricks Monitoring Library** **Workspace: Azure Log Analytics**

Answer 244

The correct dynamic management view in Azure Synapse to monitor transactions that have rolled back is: **B. sys.dm_pdw_nodes_tran_database_transactions** You can use Dynamic Management Views (DMVs) to monitor your workload including investigating query execution in SQL pool. If your queries are failing or taking a long time to proceed, you can check and monitor if you have any transactions rolling back. Example: -- Monitor rollback SELECT - SUM(CASE WHEN t.database_transaction_next_undo_lsn IS NOT NULL THEN 1 ELSE 0 END), t.pdw_node_id, nod.[type] FROM sys.dm_pdw_nodes_tran_database_transactions t JOIN sys.dm_pdw_nodes nod ON t.pdw_node_id = nod.pdw_node_id GROUP BY t.pdw_node_id, nod.[type] Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-monitor#monitor-transaction-log-rollback

Answer 245

To address the increasing Backlogged Input Events metric, you should consider: **B. Increase the number of streaming units (SUs).** Increasing the number of streaming units enhances the job's capacity to handle the incoming event load and potentially reduces backlogged events. Backlogged Input Events: Number of input events that are backlogged. A non-zero value for this metric implies that your job isn't able to keep up with the number of incoming events. If this value is slowly increasing or consistently non-zero, you should scale out your job. You should increase the Streaming Units. Note: Streaming Units (SUs) represents the computing resources that are allocated to execute a Stream Analytics job. The higher the number of SUs, the more CPU and memory resources are allocated for your job. Reference: https://docs.microsoft.com/bs-cyrl-ba/azure/stream-analytics/stream-analytics-monitoring

Answer 246

**Correct Answer: D 🗳️** some chat and gpt say c but go for D The number of records for each warehouse is big enough for a good partitioning. Note: Table partitions enable you to divide your data into smaller groups of data. In most cases, table partitions are created on a date column. When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases.

Answer 247

Should be **C** and **D**

Answer 248

**Correct Answer: C 🗳️** In a real-world scenario, you could have hundreds of these sensors generating events as a stream. Ideally, a gateway device would run code to push these events to Azure Event Hubs or Azure IoT Hubs. Your Stream Analytics job would ingest these events from Event Hubs and run real-time analytics queries against the streams. Create a Stream Analytics job: In the Azure portal, select + Create a resource from the left navigation menu. Then, select Stream Analytics job from Analytics. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-get-started-with-azure-stream-analytics-to-process-data-from-iot-devices

Answer 249

For monitoring columnstore segments in Azure Synapse SQL pool, the metrics that would help diagnose issues related to queried columnstore segments include: **B. Cache used percentage:** This metric indicates how much of the columnstore cache is being used. A high cache used percentage could indicate that the columnstore cache is being heavily utilized, potentially causing performance issues due to cache pressure or inadequate cache size for the workload. **D. Cache hit percentage:** This metric reveals the efficiency of the columnstore cache by indicating the percentage of queries that are satisfied by data in the cache rather than needing to fetch data from disk. A lower cache hit percentage might signify that queries are frequently fetching data from disk, impacting query performance. Monitoring these metrics can provide insights into the columnstore cache utilization and effectiveness, aiding in diagnosing performance issues related to queried columnstore segments in Azure Synapse SQL pool.

Answer 250

**Correct Answer: B 🗳️** Monitor and troubleshoot slow query performance by determining whether your workload is optimally leveraging the adaptive cache for dedicated SQL pools. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-how-to-monitor-cache

Answer 251

**Selected Answer: A** Answer : A.Clusters why not workspace ? Workspace is not a service that you should log to track changes in compute for the Databricks resource because it does not record events related to creating, editing, deleting, starting, or stopping clusters or jobs. Workspace events are related to actions performed on the workspace itself, such as creating, renaming, deleting, or importing notebooks, folders, libraries, or repos1. These events do not affect the compute resources used by the Databricks resource, but rather the workspace content and configuration. Therefore, workspace is not a relevant service for logging compute changes.

Answer 252

In monitoring for replication delays that impact the Recovery Point Objective (RPO) in a geo-zone-redundant storage (GZRS) setup, you should include **Last Sync Time** as part of the monitoring solution. This metric provides information about the time of the last successful synchronization between paired regions, allowing you to gauge any replication delays and assess whether they impact the RPO.

Answer 253

To monitor for an invalid schema error related to PolyBase loading data from CSV files stored in Azure Data Lake Storage Gen2 using an external table, you should monitor error **B**: ``` Cannot execute the query "Remote Query" against OLE DB provider "SQLNCLI11" for linked server "(null)". Query aborted - the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed. ``` This error specifically mentions the rejection of rows due to schema issues while reading from an external source, indicating that there are problems with the schema or structure of the data being read from the CSV files.

Answer 254

**Correct Answer: D 🗳️** Data skew means the data is not distributed evenly across the distributions. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Answer 255

**Correct Answer: B 🗳️** Hash-distribution improves query performance on large fact tables. Incorrect Answers: A: Do not use a date column for hash distribution. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work.

Answer 256

GPT and most of chat think this For the first question, if the stored procedure `Stored procedure1` successfully executes `Web1` and `Set variable1`, it would most likely **succeed**. Regarding the second question, if `Web1` fails and `Set variable2` succeeds, the overall pipeline execution would probably end up as **Failed**, considering the failure in one of the activities within the pipeline.

Answer 257

To debug activities within Azure Data Factory pipelines that contain various types of activities like Wrangling data flow, Notebook, Copy, and Jar, you should consider using the following Azure services: **D. Azure Data Factory: It offers built-in debugging capabilities and monitoring features for Data Factory pipelines.** **E. Azure Databricks: It provides an environment for debugging, development, and execution of notebooks, making it suitable for debugging notebook activities.** So, the correct options for debugging the activities within these pipelines would be D. Azure Data Factory and E. Azure Databricks.

Answer 258

To identify the extent of data skew in Table1 within Azure Synapse Analytics: **D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats.** This dynamic management view (`sys.dm_pdw_nodes_db_partition_stats`) provides partition-level information like row count and distribution skewness, allowing you to assess data skew across the nodes in your dedicated SQL pool.

Answer 259

Correct Answer: B 🗳️ Monitor and troubleshoot slow query performance by determining whether your workload is optimally leveraging the adaptive cache for dedicated SQL pools. Reference: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-how-to-monitor-cache

Answer 260

Correct Answer: D 🗳️ Data Factory stores pipeline-run data for only 45 days. Use Azure Monitor if you want to keep that data for a longer time. Reference: https://docs.microsoft.com/en-us/azure/data-factory/monitor-using-azure-monitor

Answer 261

For real-time monitoring of IoT devices, especially when utilizing Azure IoT Hub, the suitable recommendation would be: **B. Azure Stream Analytics Edge application using Microsoft Visual Studio** Azure Stream Analytics is designed to process and analyze streaming data in real-time, making it an apt choice for monitoring IoT devices efficiently. The Edge application enables processing close to the data source, which is ideal for IoT scenarios, while Microsoft Visual Studio provides a familiar development environment for creating and deploying Stream Analytics applications.

Answer 262

**Correct Answer: C 🗳️** Use sys.pdw_nodes_column_store_row_groups to determine which row groups have a high percentage of deleted rows and should be rebuilt. Note: sys.pdw_nodes_column_store_row_groups provides clustered columnstore index information on a per-segment basis to help the administrator make system management decisions in Azure Synapse Analytics. sys.pdw_nodes_column_store_row_groups has a column for the total number of rows physically stored (including those marked as deleted) and a column for the number of rows marked as deleted. Incorrect: Not A: You can join sys.pdw_nodes_column_store_segments with other system tables to determine the number of columnstore segments per logical table. Not B: Use sys.dm_db_column_store_row_group_operational_stats to track the length of time a user query must wait to read or write to a compressed rowgroup or partition of a columnstore index, and identify rowgroups that are encountering significant I/O activity or hot spots.

Answer 263

The **"DWU percentage" (C)** would likely be the best metric to monitor in this scenario. It provides a clear indicator of how much of the current Data Warehouse Unit (DWU) capacity is being utilized at any given time. Monitoring this metric helps in understanding if the current service level is adequate or if there's a need to scale up to handle the workloads efficiently.

Answer 264

To monitor IoT devices in real-time, the suitable solution would be: **C. Azure Stream Analytics cloud job using Azure Portal** Azure Stream Analytics provides a real-time data stream processing service that can efficiently handle high volumes of data generated by IoT devices. It's specifically designed for real-time analytics, making it a fitting choice for monitoring IoT devices as the data streams in, allowing quick and efficient analysis and actions based on that data.

Answer 265

**Box 1: 16 -** For Event Hubs you need to set the partition key explicitly. An embarrassingly parallel job is the most scalable scenario in Azure Stream Analytics. It connects one partition of the input to one instance of the query to one partition of the output. **Box 2: Transaction ID -** Reference: https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-features#partitions For the Stream Analytics job output to be highly scalable and process transactions quickly, you'd want to distribute the data across multiple partitions to leverage parallel processing capabilities. Given the scenario, here's the optimal configuration: Number of partitions: 16 Partition key: Transaction ID Explanation: The number of partitions should match the number of partitions in the input event hub (retailhub) to ensure efficient processing. Using Transaction ID as the partition key helps maintain the order of transactions and ensures that all events with the same Transaction ID go to the same partition, allowing easy retrieval and processing of related data. This setup aligns with the scalability requirement while utilizing the transactional nature of the data for efficient processing.

Answer 266

Chats says **1. Hash Distributed, ProductKey** because >2GB and ProductKey is extensively used in joins **2. Hash Distributed, RegionKey** because "The table size on disk is more than 2 GB." and you have to chose a distribution column which: "Is not used in WHERE clauses. This could narrow the query to not run on all the distributions." source: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#choosing-a-distribution-column

Answer 267

**Correct Answer: B 🗳️** To maximize the benefits of partition elimination in Azure Synapse Analytics when querying a partitioned table, you should include the **WHERE** clause in your Transact-SQL queries. This clause allows you to filter data based on partitioning columns, enabling the system to eliminate irrelevant partitions and focus the query on the necessary data subset.

Answer 268

CONSENSUS ON THIS To reduce latency in an Azure Stream Analytics query that returns 10,000 distinct values for a column named clusterID: B. **Increase the number of streaming units:** This action increases the processing power available to handle the query workload, potentially reducing latency by allowing more resources to handle the data. D. **Scale out the query by using PARTITION BY:** This enables the query to run in parallel on different partitions of the data, enhancing processing speed and potentially decreasing latency. Reference: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-streaming-unit-consumption https://docs.microsoft.com/en-us/azure/stream-analytics/repartition

Answer 269

To identify the extent of data skew in a table within an Azure Synapse Analytics dedicated SQL pool: **D. Connect to Pool1 and query sys.dm_pdw_nodes_db_partition_stats**. This dynamic management view provides details on the distribution of data across distributions, allowing you to assess data distribution and potential skew across partitions within the table.

Answer 270

**A. Connect to Pool1 and DBCC PDW_SHOWSPACEUSED.** Selected Answer: A https://github.com/rgl/azure-content/blob/master/articles/sql-data-warehouse/sql-data-warehouse-manage-distributed-data-skew.md

Answer 271

Go with chat **E. Register the query acceleration feature.** **D. Create a storage policy that is scoped to a container prefix filter.** To filter data at the time it is read from disk, you need to use the query acceleration feature of Azure Data Lake Storage Gen2. To enable this feature, you need to register the query acceleration feature in your Azure subscription. In addition, you can use storage policies scoped to a container prefix filter to specify which files and directories in a container should be eligible for query acceleration. This can be used to optimize the performance of the queries by only considering a subset of the data in the container.

Answer 272

**Selected Answer: A** Connect to Pool1 and run DBCC PDW_SHOWSPACEUSED Azure Synapse Analytics dedicated SQL pool (formerly known as Azure Synapse Analytics Parallel Data Warehouse) uses a Massively Parallel Processing (MPP) architecture and DBCC PDW_SHOWSPACEUSED is a system stored procedure that can be used to check the distribution of data across the compute nodes. By running this command on Pool1 and specifying the fact table Table1, you can identify the extent of data skew in Table1 and determine if the data is evenly distributed across the compute nodes or if it is skewed towards a specific node

Answer 273

going with majority of chat **Selected Answer: A** Correct answer is A. We are just copying files between folders. Selecting binary copy, ADF will not check schema. With D we would discard data With C we would change file contents

Answer 274

**D. Azure Databricks** Azure Databricks provides a collaborative Apache Spark-based analytics platform that simplifies and streamlines the process of analyzing data at scale. It offers a powerful environment for processing large volumes of data efficiently, making it an ideal choice for analyzing network and system activity data for malicious activities and policy violations. The collaborative features and optimized Spark-based processing capabilities help minimize administrative efforts while performing complex analytics tasks.

Answer 275

Monitor the database for long-running queries: **sys.dm_pdw_exec_requests** Identify which queries are waiting on resources: **sys.dm_pdw_waits** The sys.dm_pdw_lock_waits view is specific to SQL Server and is used to monitor lock waits and lock resources in regular SQL Server environments, not in Azure Synapse Analytics dedicated SQL pools. My answers are: 1. sys.dm_pdw_exec_requests 2. sys.dm_pdw_waits There is a similar question in the microsoft official practice assessment and the explaination is the following: The sys.dm_pdw_waits view holds information about all wait stats encountered during the execution of a request or query, including locks and waits on a transmission queue

Answer 276

NOT SURE ON THIS, CHAT SAY A **A. Scale out the self-hosted integration runtime.** gpt has said different things Scaling out the self-hosted integration runtime or scaling up the data flow runtime of the Azure integration runtime wouldn't directly maximize the compute resources available to Copy1. The most effective way to maximize compute resources for Copy1 in this scenario is: C. Scale up the data flow runtime of the Azure integration runtime. By increasing the capacity of the Azure integration runtime's data flow, you enhance its capability to handle and process data more efficiently, which optimizes the compute resources for the Copy activity.

Answer 277

For optimizing queries and joins on non-partitioned tables and columns in Delta Lake on Azure Databricks: **B. Z-Ordering: Z-Ordering organizes data within files to colocate related information physically, aiding in efficient query processing and reducing the data shuffle during joins or queries on specific columns.** **D. Dynamic File Pruning (DFP): DFP, also known as predicate pushdown, is a feature that leverages file metadata to skip reading irrelevant files, enhancing query performance significantly by reducing the amount of data scanned.** Both Z-Ordering and DFP play crucial roles in improving query performance and join operations on non-partitioned tables and columns in Delta Lake.

Answer 278

**Selected Answer: C** Answer is correct. Customers stores huge amount of data in Azure blob storage. Sometimes this data is accessed frequently and other times infrequently. Last access time tracking integrates with the lifecycle of Azure blob storage to allow automatic tiering and deletion of data based on when individual blobs are accessed last.

Answer 279

**Correct Answer: C 🗳️**

Answer 280

**Collect:** Pipeline activity runs log **Send to:** Log Analytics workspace

Answer 281

**Selected Answer: B** Monitoring DWU used (Option C) can certainly be part of a comprehensive approach to diagnosing the performance issues, focusing on the cache hit percentage (Option B) might offer a more targeted way to address the specific problem described in the scenario.

Answer 282

In the Azure portal: **Add a role-based access control (RBAC) role to kv1.** In Synapse Studio: **Create a linked service to kv1.**

Answer 283

**No, No, No** The Retry Property is not set to one for Web_GetIP: Otherwise, we would see a retry of that activity in the first run. waitOnCompletion property is not set to true: In the second run, Exec_COPY_BLOB takes as long as in the first one, despite being skipped. So, it could not have been waiting for the pipeline that it had triggered to complete. Exec_COPY_BLOB cannot be skipped due to a pipeline dependency since it is the first activity in the pipeline. Most likely, its activity state was manually set to ‚skipped‘.

Answer 284

Selected Answer: B It is indeed B: [LINK](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#determine-if-the-table-has-data-skew)

Answer 285

For debugging activities within your Azure Data Factory pipelines, you should consider using: **B. Azure Data Factory**: This service provides features like monitoring, logs, and the ability to view pipeline run details, making it an essential tool for debugging Data Factory workflows. **E. Azure Databricks:** It offers comprehensive capabilities for debugging activities, especially for Notebook and Jar activities, providing interactive debugging, log tracking, and monitoring tools. These services can offer specific functionalities for debugging different activity types within your Data Factory pipelines.

Answer 286

To monitor IoT devices in real-time, the suitable solution would be: **C. Azure Stream Analytics cloud job using Azure Portal** Azure Stream Analytics provides a real-time data stream processing service that can efficiently handle high volumes of data generated by IoT devices. It's specifically designed for real-time analytics, making it a fitting choice for monitoring IoT devices as the data streams in, allowing quick and efficient analysis and actions based on that data.

Answer 287

To perform a monthly audit of SQL statements affecting sensitive data with minimal administrative effort in Azure Synapse Analytics dedicated SQL pool, the most suitable approach would be: **B. Sensitivity Labels:** Implementing sensitivity labels allows you to classify and label sensitive data. You can then track and audit access to data based on these sensitivity labels. This method enables you to easily identify and audit SQL statements that interact with sensitive information, ensuring compliance and security measures are met.

Answer 288

For real-time monitoring of IoT devices through Azure IoT Hub, the appropriate solution is: **B. Azure Stream Analytics Edge application using Microsoft Visual Studio** Azure Stream Analytics Edge allows real-time data processing close to the data source, enabling immediate analysis and monitoring of incoming data from IoT devices. With Stream Analytics Edge, you can process and analyze data as it arrives, ensuring swift monitoring of manufacturing machinery.

Answer 289

1. Activity1 appears to be a Copy activity given the "dataRead," "dataWritten," and "copyDuration" fields. 2. Activity1 does not use a self-hosted integration runtime; it uses an Azure integration runtime as indicated by "AutoResolveIntegrationRuntime." 3. The data factory is connected to Microsoft Purview, as evidenced by the "reportLineageToPurview" section indicating a successful status. Hence, the correct answers are: - **Yes**, Activity1 is a Copy activity. - **No**, Activity1 is not executed using a self-hosted integration runtime. - **Yes**, the data factory that executed the pipeline is connected to Microsoft Purview.

Answer 290

**Correct Answer: B 🗳️**

Answer 291

**Answer: A** Explanation: You can override the primary language by specifying the language magic command % at the beginning of a cell. The supported magic commands are: %python, %r, %scala, and %sql. In Azure Databricks notebooks, you can switch between languages using "magic commands" or "magic switches." For R, Scala, and SQL, the magic commands typically used are: A. `%` (e.g., `%r`, `%scala`, `%sql`) So, in this case, the correct switch to switch between languages would be A. `%`.

Answer 292

To identify whether a recently executed query on Azure Synapse Analytics dedicated SQL pool (Pool1) used the result set cache, you can leverage the following methods: A. **Review the sys.dm_pdw_sql_requests dynamic management view in Pool1.** - This view provides information about the queries executed in the dedicated SQL pool, including details about whether the result set cache was used. B. **Review the sys.dm_pdw_exec_requests dynamic management view in Pool1.** - Similar to the sys.dm_pdw_sql_requests view, this view provides insights into the queries executed in the dedicated SQL pool, allowing you to identify if the result set cache was utilized. These dynamic management views within the SQL pool itself (option A and B) provide specific details about query execution and cache utilization. The other options: C. **Use the Monitor hub in Synapse Studio.** - While the Monitor hub offers monitoring and insights, it might not provide detailed information about the result set cache usage for individual queries in the SQL pool. D. **Review the AzureDiagnostics table in la1.** - The AzureDiagnostics table in Log Analytics might capture overall system-level diagnostic information, but it might not specifically detail query-level result set cache usage. E. **Review the sys.dm_pdw_request_steps dynamic management view in Pool1.** - This view provides details about the steps involved in query execution but might not specifically highlight the result set cache usage for a query. So, the correct ways to achieve the goal of identifying whether a recently executed query on Pool1 used the result set cache are options A and B by reviewing the `sys.dm_pdw_sql_requests` and `sys.dm_pdw_exec_requests` dynamic management views in Pool1.

Answer 293

Box 1: SCHEMABINDING Box 2: Filter

Answer 294

ADFTriggerRun: This table contains information about the triggers that start the pipeline runs. It tracks when triggers start and their statuses.

Answer 295

A. Send to Log Analytics workspace

Answer 296

To identify how long it takes to write data to Pool1 using a mapping data flow in Azure Synapse Analytics, you should use the metric: B. **The sink processing time** The sink processing time represents the duration taken by the final sink operation, which is responsible for writing the data to the destination (in this case, writing data to Pool1). It measures the time taken from the start of processing within the sink component until it completes writing the data to the target destination. This metric specifically tracks the time taken during the data writing process, providing insight into the duration it takes for the data flow to complete writing data into the specified destination (Pool1 in this scenario).

Exam questions Flashcards

(325 cards)