Az Cloud Academy Certification test #3 Flashcards

1
Q

An application running on a third-party device has been programmed to send a stream of events to an IoT Hub within Azure. The device sends a temperature from a gauge. You have set up an Azure Stream Analytics streaming job to use this IoT Hub as a source. Your company needs to retain the average temperature over the last 30 seconds. The result should be recorded into an Azure Data Lake every 10 seconds.Which window function should be used?
SessionWindow(second,30,10)

HoppingWindow(second,30,10)

SlidingWindow(second,30,10)

TumblingWindow(second,30,10)

A

HoppingWindow(second,30,10)

Explanation
The windows functions are as follows (periods could be seconds, minutes, hours and so on):

HoppingWindow (period, n1, n2) performs the aggregation over the last n1 periods, runs every n2 periods
SlidingWindow (period, n) performs the aggregation over the last n periods, runs for every event
TumblingWindow(period,n) performs the aggregation over the last n periods and runs every n periods
SessionWindow (period, n1, n2) creates a session that starts with the first event and extended if another event occurs within n1 periods to a maximum of n2 periods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

As a data engineer managing a company’s Azure workloads, you need to upload 22 TB of data in Azure Storage into an Azure Synapse dedicated SQL pool. The 22 TB of data is Hive data is in Optimized Row Columnar (ORC) format. After starting the upload, Azure displayed Java out-of-memory errors. What of the steps below could you take to complete the upload without generating similar errors?

Use compressed delimited text files

Export only a subset of the columns

Colocate your storage layer and your dedicated SQL pool

None of these options will prevent an error.

A

Export only a subset of the columns

Explanation
When exporting data into an ORC File Format, you might get Java out-of-memory errors when there are large text columns. To work around this limitation, export only a subset of the columns. Other options are also some of the easy to increase performance for loading data into Azure Synapse dedicated SQL. But for this particular error type, the best-suited solution is to export only a subset of large columns where you get this problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

You are designing an Azure Data Factory pipeline that will use deploy HDInsight clusters to process data stored in an Azure Data Lake data store. As part of your design, you need to control and limit the access HDInsight has to Azure Data Lake, to limit the data it reads and processes. Which of the following should you implement to control HDInsight clusters’ access to Azure Data Lake Store?
Create a user-assigned managed identity to access Data Lake Storage Gen2.

Create a system-assigned managed identities to access Data Lake Storage Gen2.

Create a registered service principal for HDInsight clusters in Azure Active Directory.

Create an Azure Active Directory role for the HDInsight clusters.

A

Create a user-assigned managed identity to access Data Lake Storage Gen2.

Explanation
Your HDInsight cluster’s ability to access files in Data Lake Storage Gen2 is controlled through managed identities. A managed identity is an identity registered in Azure Active Directory (Azure AD) whose credentials are managed by Azure. With managed identities, you don’t need to register service principals in Azure AD. Or maintain credentials such as certificates.

Azure services have two types of managed identities: system-assigned and user-assigned.

HDInsight only uses user-assigned managed identities to access Data Lake Storage Gen2. A user-assigned managed identity is created as a standalone Azure resource. Azure creates an identity in the Azure AD tenant that’s trusted by the subscription in use. After the identity is created, the identity can be assigned to one or more Azure service instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Your data engineering must set up a pipeline that sends application logs from Azure Databricks to a Log Analytics workspace using the Log4j appender within the Azure Databricks Monitoring Library.Which step or option is not involved in the setup of this pipeline?

Build the spark-listeners-1.0-SNAPSHOT.jar and the spark-listeners-loganalytics-1.0-SNAPSHOT.jar JAR file.

Create a log4j.properties configuration file for your application.

Include the spark-listeners-loganalytics project in your application code, and import com.microsoft.pnp.logging.Log4jconfiguration to your application code.

Create Dropwizard gauges or counters in your application code.

A

Create Dropwizard gauges or counters in your application code.

Explanation
All of the options except one are the steps to send your Azure Databricks application logs to Azure Log Analytics using the Log4j appender. The option which says “Create Dropwizard gauges or counters in your application code” is used only when the method chosen to send logs is based on Dropwizard Metrics Library. Here the question clearly mentions that you have already decided to use Log4j appenders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

You are reviewing the recent metrics of an Azure Stream Analytics job, and notice an increase in the number of Late Input Events. Which of the following would you adjust to optimize the job performance?

The job’s Out of Order Tolerance

The job’s Late Arrival Tolerance

The job’s Early Arrival Tolerance

The job’s Start Time

A

The job’s Late Arrival Tolerance

Explanation
Stream Analytics jobs have several Event ordering options. Two can be configured in the Azure portal: the Out of order events setting (out-of-order tolerance), and the Events that arrive late setting (late arrival tolerance). The early arrival tolerance is fixed and cannot be adjusted. These time policies are used by Stream Analytics to provide strong guarantees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

As part of the Data team, you are assigned the task of choosing the right Integration Runtime (IR) for Azure Data Factory for your latest project. The critical project network and capability requirements are: The IR must support copy activities between Azure-hosted data stores and on-premises data stores in private networks. The IR must be able to monitor compute jobs run on HDInsight and Azure Machine Learning. Which of the Integration Runtimes below (if any) would meet the project’s requirements?

Only Self-hosted Integration Runtime

Only Azure-SSIS Integration Runtime

Azure Integration Runtime and Azure-SSIS Integration Runtime

Self-hosted Integration Runtime and Azure-SSIS Integration Runtime

A

Only Self-hosted Integration Runtime

Explanation
In this scenario, the copy activity is between Azure cloud data stores and a data store in an on-premises private network. Azure Integration Runtime does not support this.

Private network support for data movement and activity dispatch is not available for Azure-SSIS Integration Runtime. This leaves the Self-Hosted Integration Runtime as a suitable solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

You have Azure Data Lake Storage which contains a very large amount of data. There are various pipelines triggered for analyzing the data that arrived that day, week and month, and the ADLS store needs a data archival policy that meets the following requirements: New data will be requested and updated thousands of times in the first 30 days. After 30 days, data will be accessed occasionally and should be available immediately. After 180 days will not be accessed. Which actions should be taken to meet these requirements in the most cost-effective way? (Choose 2 options)
Data will first be stored in the hot tier for the first 30 days, move to the cool tier after 30 days, and move to the archive tier after 180 days.

Data will be stored in the cool tier for the first 30 days, move to the archive tier after 30 days, and deleted after 180 days.

Data will be stored in the cool tier for the first 180 days, and be deleted after 180 days.

Data will be stored in the hot tier for the first 30 days, move to the cool tier after 30 days, and be deleted after 180 days.

A

Data will first be stored in the hot tier for the first 30 days, move to the cool tier after 30 days, and move to the archive tier after 180 days.

Data will be stored in the hot tier for the first 30 days, move to the cool tier after 30 days, and be deleted after 180 days.

Explanation
Azure storage offers different access tiers, allowing you to store blob object data in the most cost-effective manner. Available access tiers include:

Hot - Optimized for storing data that is accessed frequently.
Cool - Optimized for storing data that is infrequently accessed and stored for at least 30 days.
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of hours.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

There are several statements below about the benefits of partitioning tables in a dedicated SQL pool. Statement 1: Using partitions to maintain data will avoid transactional logging. Statement 2: Partition switching can be used to quickly remove or replace a section of a table. Statement 3: Deleting data row-by-row using delete statements is faster than deleting an entire partition. Which choice below correctly identifies the statements and true or false?

All the statements are true.

Statement 1 - True
Statement 2 - True
Statement 3 - False

Statement 1 - False
Statement 2 - True
Statement 3 - True

Statement 1 - True
Statement 2 - False
Statement 3 - True

A

Statement 1 - True
Statement 2 - True
Statement 3 - False

Explanation
The primary benefit of partitioning in a dedicated SQL pool is to improve the efficiency and performance of loading data by use of partition deletion, switching and merging. In most cases, data is partitioned on a date column that is closely tied to the order in which the data is loaded into the SQL pool.

One of the greatest benefits of using partitions to maintain data is the avoidance of transaction logging. While simply inserting, updating, or deleting data can be the most straightforward approach, with a little thought and effort, using partitioning during your load process can substantially improve performance. Partition switching can be used to quickly remove or replace a section of a table. For example, a sales fact table might contain just data for the past 36 months.

At the end of every month, the oldest month of sales data is deleted from the table. This data could be deleted by using a delete statement to delete the data for the oldest month.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

There are two Azure Data Factory (ADF) Pipelines. The first pipeline (pipeline A) should be triggered when a new file is saved into the storage account. The second pipeline (pipeline B) should be triggered every 4 hours. Statement 1:Pipeline A should use an Event trigger. Statement 2: Pipeline B should use a Schedule trigger. Which choice below identifies the correct statement(s)?

Only statement 1 is correct.

Statements 1 and 2 are correct.

Only statement 2 is correct.

Neither statement 1 or 2 is correct.

A

Statements 1 and 2 are correct.

Explanation
Both statements are correct.

Triggers represent a unit of processing that determines when a pipeline execution needs to be kicked off. Currently, Data Factory supports three types of triggers:

Schedule trigger: A trigger that invokes a pipeline on a wall-clock schedule.
Tumbling window trigger: A trigger that operates on a periodic interval, while also retaining state.
Event-based trigger: A trigger that responds to an event.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A company is migrating three on-premises Microsoft SQL Server databases to Azure. The company would like to minimize the cost of running the service in Azure. They have analyzed the usage of the databases before migration as shown below: Database 1: Used predominantly during the first week of the month with heavy analytics and querying during working hours (8:00 am-6:00 pm). Database 2: Used throughout the month for querying although data is uploaded nightly Database 3: Used by the data science team to train their machine learning models within R and Python. This will be updated to be used within Azure Databricks once the migration has taken place. The training of the models will be performed daily. The data volumes can be handled easily by Azure SQL Database. How should the Azure SQL Databases be implemented?
Each having a set number of DTUs

Database 3 as an Azure SQL Data Warehouse and databases 1 and 2 on an Azure SQL VM

Convert database 3 to an Azure Data Lake (Gen 2) and databases 1 and 2 as Cosmos DB

Implement all as Azure SQL Databases included in a single elastic pool

A

Implement all as Azure SQL Databases included in a single elastic pool

Explanation
As the usage of the three databases is distributed to be used at different times, an elastic pool should be used to maximize the usage available to each when required. The cost of an Azure SQL Data Warehouse and Azure SQL VMs may provide better overall performance, but would not minimize the costs (requirement).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The following JSON is an example of a Parquet dataset on Azure Blob Storage.

{
“name”: “ParquetDataset”, “properties”:
{ “type”: “Parquet”, “linkedServiceName”:
{ “referenceName”: “”, “type”: “LinkedServiceReference”
},
“schema”: [ < physical schema, optional, retrievable during authoring > ], “typeProperties”: { “location”: { “type”: “AzureBlobStorageLocation”, “container”: “containername”, “folderPath”: “folder/subfolder”, },
“compressionCodec”: “LZO” }
}
}

Which of the JSON dataset properties is configured incorrectly?
“type”: “Parquet”
“schema”: [ < physical schema, optional, retrievable during authoring > ],
“type”: “AzureBlobStorageLocation”,
“compressionCodec”: “LZO”

A

“compressionCodec”: “LZO”

Explanation
The compression codec to use when writing to Parquet files. When reading from Parquet files, Data Factories automatically determine the compression codec based on the file metadata.

Supported types are “none”, “gzip”, “snappy” (default), and “lzo”.

Note currently Copy activity doesn’t support LZO when read/write Parquet files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The following are two statements about Partitioning tables in a dedicated SQL pool: Statement 1: A query that applies a filter to partitioned data can limit the scan to only the qualifying partitions. Statement 2: Partitioning is only supported on hash distributed data. Which statements, if any, are correct?

Only statement 1 is correct.
Only statement 2 is correct.
Both statements 1 and 2 are correct.

Both statements 1 and 2 are incorrect.

A

Only statement 1 is correct.

Explanation
The primary benefit of partitioning in a dedicated SQL pool is to improve the efficiency and performance of loading data by use of partition deletion, switching and merging. In most cases, data is partitioned on a date column that is closely tied to the order in which the data is loaded into the SQL pool.

One of the greatest benefits of using partitions to maintain data is the avoidance of transaction logging. While simply inserting, updating, or deleting data can be the most straightforward approach, with a little thought and effort, using partitioning during your load process can substantially improve performance.

Partition switching can be used to quickly remove or replace a section of a table. For example, a sales fact table might contain just data for the past 36 months. At the end of every month, the oldest month of sales data is deleted from the table. This data could be deleted by using a delete statement to delete the data for the oldest month.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

An electronics company utilizes ‎Azure Data Lake Storage (ADLS) Generation 1 for Big Data Analytics. As part of the data analytics team, your new assignment is to plan and design the migration of ADLS Generation 1 to ADLS Generation 2. Only a small number of existing pipelines are connected to the current data lakes, but your team requires that the migration results in no downtime for any related applications and that the process requires minimal administration. Which of the following migration methods would best meet these requirements?

Lift and Shift

Incremental Copy

Dual Pipeline

Bidirectional Sync

A

Dual Pipeline

Explanation
The need for no downtime disqualifies Lift and Shift and Incremental copy from the consideration.
Dual Pipelines are ideal in situations where your workloads and applications can’t afford any downtime, and you can ingest into both storage accounts.

Bidirectional sync is ideal for complex scenarios that involve a large number of pipelines and dependencies where a phased approach might make more sense. However, bidirectional sync can require detailed planning and considerable administrative effort, so this leaves Dual Pipeline as the most suitable pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A colleague has loaded a CSV into an Azure Databricks workspace that you share. You need to read the data from the CSV to process the data using Scala. The CSV holds the column names as the first row of the file. Which piece of code should you use to read the data, for which you do not have a preset schema?

Val sparkDF = spark.read.format(“csv”)
.options(“header”,”true”)
.options(“inferSchema”,”true”)
.load(“state_data.csv”)

Val sparkDF = spark.read.format(“csv”)
.options(“header”,”true”)
.load(“/FileStore/tables/state_data.csv”)

Val sparkDF = spark.read.format(“csv”)
.options(“header”,”true”)
.options(“inferSchema”,”true”)
.load(“/FileStore/tables/state_data.csv”)

Val sparkDF = spark.read.format(“csv”)
.options(“inferSchema”,”true”)
.load(“/FileStore/tables/state_data.csv”)

A

Val sparkDF = spark.read.format(“csv”)
.options(“header”,”true”)
.options(“inferSchema”,”true”)
.load(“/FileStore/tables/state_data.csv”)

Explanation
To get the CSV loaded you should specify the following:

File type = spark.read.format(“csv”)

First row as header = .options(“header”,”true”)

Read the schema in the first row = .options(“inferSchema”,”true”)

filename in the filestore = .load(“/FileStore/tables/state_data.csv”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

You are the data engineer for a very large e-commerce website with a global userbase. You have a data pipeline that gathers clickstream with Azure Event Hub and sends it to Azure Stream Analytics. You need to design a query that will: Aggregate the number of clicks into distinct periods of time Divide the numbers based on the user region Count each click once and only once Which of the following functions should be used in your query?

a tumbling window function

a session window function

a sliding window function

a hopping window function

A

a tumbling window function

Explanation
Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them. The key differentiators of a tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

While configuring a data workflow, you are deciding how to load data into an Azure Synapse Analytics staging table. You would like the data to load into the staging table as quickly as possible. Which distribution type(s) would be optimal for this scenario?

Hash-distributed

Round Robin

Either Round Robin or Replicated

Either Replicated or Hash-Distributed

A

Round Robin

Explanation
Synapse Analytics offers three choices, round-robin, hash-distributed, and replicated. A round-robin table has a very simple distribution pattern. Rows are distributed evenly across the data distributions. This is why it’s the fastest distribution type for loading data into a staging table; it doesn’t perform any optimization.

17
Q

You have configured an Azure Stream Analytics job and want to check its progress periodically using the metric graphing feature available in Azure Portal. You need to monitor the following metrics: Streaming units (percentage) Late input events (count) Early input events (count) Input Event Bytes (bytes) Runtime errors (count) Out-of-Order Events (count) You prefer to create the minimum number of graphs, for optimal efficiency. How many graphs will you need to create?

5

1

3

2

A

3

Explanation
All of the metrics on the same graph have to be the same unit of measure. There are three units of measure in the collection of metrics listed in this question - count, percentage, and number of bytes. Therefore, the correct answer is three.

18
Q

A company has to implement an application that would generate PDF files. The application would only need to store the PDF’s and JSON metadata related to the PDF files. The PDF files would then be distributed over the web to various users. The PDF files could grow large in size. What Azure data and storage solutions are recommended for the application?

Create Virtual machines and store the PDF’s and JSON metadata on the virtual machines
Create an Azure SQL database and store the PDF’s and JSON metadata in a table on the database.
Create an Azure Cosmos DB to store the metadata and the PDF’s files.
Use Blob storage to store the PDF’s and the relevant JSON metadata in Azure Cosmos DB.

A

Use Blob storage to store the PDF’s and the relevant JSON metadata in Azure Cosmos DB.
Explanation
There are different types of Blob storage available

Block blobs are ideal for storing text or binary files, such as documents and media files.
Append blobs are similar to block blobs in that they are made up of blocks, but they are optimized for append operations, so they are useful for logging scenarios.
Page blobs can be up to 1 TB in size, and are more efficient for frequent read/write operations
The JSON metadata can be stored in Azure Cosmos DB and the PDF files in Blob storage.

19
Q

Your company is being audited, and an external accountant needs access to review and download specific files from the blob storage and file storage services within one specific Azure storage account. You currently use Azure Active Directory to control access to the Azure storage account in question. However, you have been told you need to provide the accountant with immediate access to the blob and file storage account without any further information. How can you provide necessary access, but also limit it to the blobs in question?

Provide the accountant with read-only access to the specific Azure Blob and File storage services with a service-level shared access signature token. Allow all read requests but limit write requests to LIST and GET. Specify the HTTPS protocol is required to accept requests.

Assign the accountant a guest role in Azure Active Directory with read-only access to the specific Azure Blob and File services in the Azure Storage account.

Assign the accountant a contributor role access to the entire storage account using Azure AD role-based access control (RBAC).

Provide the accountant with read-only access to the specific Azure Blob and File storage services with an account-level shared access signature token. Allow all read requests but limit write requests to LIST and GET. Specify the HTTPS protocol is required to accept requests.

A

Provide the accountant with read-only access to the specific Azure Blob and File storage services with an account-level shared access signature token. Allow all read requests but limit write requests to LIST and GET. Specify the HTTPS protocol is required to accept requests.

Explanation
In this case, an account-level SAS is required because the accountant needs access to two separate services in the account. You do not have the necessary information to create a guest or contributor account to control the accountant’s access, but you can add controls to require requests are sent via an HTTPS protocol, and also control the specific read/write actions.

20
Q

You plan to use a U-SQL script within Azure Data Factory (ADF) to transform data in an Azure Storage blob container, then load the transformed data into an Azure Synapse Analytics table. You want to manage the entire data flow within a single schedule. Which choice below is the most reasonable option to correctly complete the task?
Create one ADF pipeline for the U-SQL job to transform the data and then copy the output of the U-SQL job to Azure Synapse Analytics.

Create two separate ADF pipelines. The first will use the U-SQL job to transform the data, and the second will copy the output of the U-SQL job to Azure Synapse Analytics.

Create one ADF pipeline for the U-SQL job to transform the data, and copy the output to Azure Synapse Analytics with Copy Data Wizard.

Create one ADF pipeline for the U-SQL job to transform the data with Copy Data Wizard, and then copy the output to Azure Synapse Analytics with Copy Data Wizard.

A

Create one ADF pipeline for the U-SQL job to transform the data and then copy the output of the U-SQL job to Azure Synapse Analytics.

Explanation
To manage the entire data flow in a single schedule, we would need to define both the data transformation activity and the data movement activity within a single pipeline.

Copy Data Wizard cannot create data transformation activities. It can only create data movement activities. Due to this, it makes it an invalid choice when users need to create both data movement and data transformation activities in the same data pipeline.

21
Q

There are two Azure Data Factory pipelines. The first pipeline (pipeline A) should run each time an existing executable file is updated in an Azure blob storage account. The second pipeline (pipeline B) should be triggered every day at 5:00 am UTC. Which of the following statements regarding these Azure Data Factory pipelines is correct? Statement 1: Pipeline A should use on-demand execution. Statement 2: Pipeline B should use Schedule trigger

Only statement 1 is correct.

Statements 1 and 2 are correct.

Only statement 2 is correct.

Neither statements 1 or 2 are correct.

A

Only statement 2 is correct.

Explanation
Triggers represent a unit of processing that determines when a pipeline execution needs to be kicked off. Currently, Data Factory supports three types of triggers:

Schedule triggers: A trigger that invokes a pipeline on a wall-clock schedule.
Tumbling window triggers: A trigger that operates on a periodic interval, while also retaining state.
Event-based triggers: A trigger that responds to an event.

22
Q

An e-commerce website generates daily sales transactions. This data is loaded into a dedicated SQL pool fact table for analysis using Azure Synapse Analytics. The main objective of the solution is to achieve high performance for queries run on large tables because the minimum table size is 3 GB. This table will have a very high number of insert, update, and delete operations. Which method is best suited for this scenario?
Use a hash distribution with a clustered columnstore index.

Use a round-robin distribution with a clustered columnstore index.

Use a hash distribution without a clustered columnstore index.

Use a round-robin distribution without clustered columnstore index.

A

Use a hash distribution with a clustered columnstore index.

Explanation
There are certain points that should be considered before selecting an apt distribution method. First, one is the size of the files. It is recommended to use a hash distribution when the file is above 2 GB in size. Here the minimum file size is 3 GB. Apart from this, we can see that the updating of tables is very frequent. This rules out the round-robin distribution from the answer.
Here the data is loaded into a fact table, and high performance of queries are expected. Performance improves when two hash tables are joined on the same distribution column. So clustered column index will be part of the answer.

23
Q

A data team is designing several data processing solutions with complex Azure Data Factory (ADF) pipelines. These pipelines are dependent upon the successful completion of HDInsight processing jobs that will be initiated simultaneously with the ADF pipelines. In order for the ADF pipeline to complete successfully, the HDInsight job output must be available. Which ADF control flow activity would work best when the pipeline must confirm and evaluate the HDInsight output before proceeding to the next task?

A wait activity

A webhook activity

A validation activity

An if condition activity

A

A validation activity

Explanation
A validation activity ensures a pipeline only continues execution if a reference dataset exists, meets specified criteria, or a timeout has been reached. This will allow ADF to evaluate the dataset and confirm.

Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline. The next choice is not an existing control activity. Until activity implements a Do-Until loop that is similar to a Do-Until looping structure in programming languages. It executes a set of activities in a loop until the condition associated with the activity evaluates to true. You can specify a timeout value for the until activity in the Data Factory. When you use a Wait activity in a pipeline, the pipeline waits for the specified period of time before continuing with the execution of subsequent activities.

24
Q

A Data Engineer is designing a dimensional model data solution. Within this solution, slowly changing dimension (SCD) is used to effectively manage the change of dimension members. When there is a change detected in the source, the dimension table data should be overwritten with the latest value without any record of the previous value. Which SCD type will be most suitable here?
Type 1

Type 2

Type 3

None of the listed types are suitable.

A

Type 1

Explanation
If you have data that may change after a certain time( not fixed or scheduled), the resulting value updates in a dimension table can be managed using SCD.

SCD type is based on how the value in a table is updated. In this question, the second part talks about overwriting the existing value and updating it with the latest ones. This is the property of Type 1 SCD.

Type 3 and Type 2 will not overwrite and update the same column with the latest values.

The comparison between SCD types can be found in the reference links.

25
Q

You are designing a data table in Azure Table Storage for optimal performance when handling a large number of read requests and a much smaller number of write requests. As part of your design preparation, you have a list of the most common queries to expect for the table. Most queries will be point queries, with exact matches for the partition and row of the desired item. There will be row range scan queries, which include an exact match for the item partition and a partial match for the item row. Almost all queries include two key properties, a group ID, which corresponds to many items, and an item ID, which is specific to one item. With this information, which of the following design choices is optimal for your Azure Storage table?

Use the group ID as the partition key and item ID as the row key

Use the item ID as the partition key and the group ID as the row key

Concatenate the group ID and item ID into a composite partition key, and leave an empty string as the row key

Use the group ID as the partition key, and leave an empty string as the row key

A

Use the group ID as the partition key and item ID as the row key

Explanation
In this situation, using the group ID as the partition key and the item ID as the row key is ideal. It will allow for point queries to complete quickly, and row range scans to resolve without crossing partition servers.

Using the item ID as the partition key would create a large number of partition servers, which is ideal for scaling large numbers of point queries but will require crossing many partitions to complete row range scans.

Concatenating the two IDs will also create a very large number of partition servers.

Using the group ID as the partition key with an empty row key will not work well for either query type.