Az Cloud Academy Certification test #3 Flashcards
An application running on a third-party device has been programmed to send a stream of events to an IoT Hub within Azure. The device sends a temperature from a gauge. You have set up an Azure Stream Analytics streaming job to use this IoT Hub as a source. Your company needs to retain the average temperature over the last 30 seconds. The result should be recorded into an Azure Data Lake every 10 seconds.Which window function should be used?
SessionWindow(second,30,10)
HoppingWindow(second,30,10)
SlidingWindow(second,30,10)
TumblingWindow(second,30,10)
HoppingWindow(second,30,10)
Explanation
The windows functions are as follows (periods could be seconds, minutes, hours and so on):
HoppingWindow (period, n1, n2) performs the aggregation over the last n1 periods, runs every n2 periods
SlidingWindow (period, n) performs the aggregation over the last n periods, runs for every event
TumblingWindow(period,n) performs the aggregation over the last n periods and runs every n periods
SessionWindow (period, n1, n2) creates a session that starts with the first event and extended if another event occurs within n1 periods to a maximum of n2 periods.
As a data engineer managing a company’s Azure workloads, you need to upload 22 TB of data in Azure Storage into an Azure Synapse dedicated SQL pool. The 22 TB of data is Hive data is in Optimized Row Columnar (ORC) format. After starting the upload, Azure displayed Java out-of-memory errors. What of the steps below could you take to complete the upload without generating similar errors?
Use compressed delimited text files
Export only a subset of the columns
Colocate your storage layer and your dedicated SQL pool
None of these options will prevent an error.
Export only a subset of the columns
Explanation
When exporting data into an ORC File Format, you might get Java out-of-memory errors when there are large text columns. To work around this limitation, export only a subset of the columns. Other options are also some of the easy to increase performance for loading data into Azure Synapse dedicated SQL. But for this particular error type, the best-suited solution is to export only a subset of large columns where you get this problem.
You are designing an Azure Data Factory pipeline that will use deploy HDInsight clusters to process data stored in an Azure Data Lake data store. As part of your design, you need to control and limit the access HDInsight has to Azure Data Lake, to limit the data it reads and processes. Which of the following should you implement to control HDInsight clusters’ access to Azure Data Lake Store?
Create a user-assigned managed identity to access Data Lake Storage Gen2.
Create a system-assigned managed identities to access Data Lake Storage Gen2.
Create a registered service principal for HDInsight clusters in Azure Active Directory.
Create an Azure Active Directory role for the HDInsight clusters.
Create a user-assigned managed identity to access Data Lake Storage Gen2.
Explanation
Your HDInsight cluster’s ability to access files in Data Lake Storage Gen2 is controlled through managed identities. A managed identity is an identity registered in Azure Active Directory (Azure AD) whose credentials are managed by Azure. With managed identities, you don’t need to register service principals in Azure AD. Or maintain credentials such as certificates.
Azure services have two types of managed identities: system-assigned and user-assigned.
HDInsight only uses user-assigned managed identities to access Data Lake Storage Gen2. A user-assigned managed identity is created as a standalone Azure resource. Azure creates an identity in the Azure AD tenant that’s trusted by the subscription in use. After the identity is created, the identity can be assigned to one or more Azure service instances.
Your data engineering must set up a pipeline that sends application logs from Azure Databricks to a Log Analytics workspace using the Log4j appender within the Azure Databricks Monitoring Library.Which step or option is not involved in the setup of this pipeline?
Build the spark-listeners-1.0-SNAPSHOT.jar and the spark-listeners-loganalytics-1.0-SNAPSHOT.jar JAR file.
Create a log4j.properties configuration file for your application.
Include the spark-listeners-loganalytics project in your application code, and import com.microsoft.pnp.logging.Log4jconfiguration to your application code.
Create Dropwizard gauges or counters in your application code.
Create Dropwizard gauges or counters in your application code.
Explanation
All of the options except one are the steps to send your Azure Databricks application logs to Azure Log Analytics using the Log4j appender. The option which says “Create Dropwizard gauges or counters in your application code” is used only when the method chosen to send logs is based on Dropwizard Metrics Library. Here the question clearly mentions that you have already decided to use Log4j appenders.
You are reviewing the recent metrics of an Azure Stream Analytics job, and notice an increase in the number of Late Input Events. Which of the following would you adjust to optimize the job performance?
The job’s Out of Order Tolerance
The job’s Late Arrival Tolerance
The job’s Early Arrival Tolerance
The job’s Start Time
The job’s Late Arrival Tolerance
Explanation
Stream Analytics jobs have several Event ordering options. Two can be configured in the Azure portal: the Out of order events setting (out-of-order tolerance), and the Events that arrive late setting (late arrival tolerance). The early arrival tolerance is fixed and cannot be adjusted. These time policies are used by Stream Analytics to provide strong guarantees.
As part of the Data team, you are assigned the task of choosing the right Integration Runtime (IR) for Azure Data Factory for your latest project. The critical project network and capability requirements are: The IR must support copy activities between Azure-hosted data stores and on-premises data stores in private networks. The IR must be able to monitor compute jobs run on HDInsight and Azure Machine Learning. Which of the Integration Runtimes below (if any) would meet the project’s requirements?
Only Self-hosted Integration Runtime
Only Azure-SSIS Integration Runtime
Azure Integration Runtime and Azure-SSIS Integration Runtime
Self-hosted Integration Runtime and Azure-SSIS Integration Runtime
Only Self-hosted Integration Runtime
Explanation
In this scenario, the copy activity is between Azure cloud data stores and a data store in an on-premises private network. Azure Integration Runtime does not support this.
Private network support for data movement and activity dispatch is not available for Azure-SSIS Integration Runtime. This leaves the Self-Hosted Integration Runtime as a suitable solution.
You have Azure Data Lake Storage which contains a very large amount of data. There are various pipelines triggered for analyzing the data that arrived that day, week and month, and the ADLS store needs a data archival policy that meets the following requirements: New data will be requested and updated thousands of times in the first 30 days. After 30 days, data will be accessed occasionally and should be available immediately. After 180 days will not be accessed. Which actions should be taken to meet these requirements in the most cost-effective way? (Choose 2 options)
Data will first be stored in the hot tier for the first 30 days, move to the cool tier after 30 days, and move to the archive tier after 180 days.
Data will be stored in the cool tier for the first 30 days, move to the archive tier after 30 days, and deleted after 180 days.
Data will be stored in the cool tier for the first 180 days, and be deleted after 180 days.
Data will be stored in the hot tier for the first 30 days, move to the cool tier after 30 days, and be deleted after 180 days.
Data will first be stored in the hot tier for the first 30 days, move to the cool tier after 30 days, and move to the archive tier after 180 days.
Data will be stored in the hot tier for the first 30 days, move to the cool tier after 30 days, and be deleted after 180 days.
Explanation
Azure storage offers different access tiers, allowing you to store blob object data in the most cost-effective manner. Available access tiers include:
Hot - Optimized for storing data that is accessed frequently.
Cool - Optimized for storing data that is infrequently accessed and stored for at least 30 days.
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of hours.
There are several statements below about the benefits of partitioning tables in a dedicated SQL pool. Statement 1: Using partitions to maintain data will avoid transactional logging. Statement 2: Partition switching can be used to quickly remove or replace a section of a table. Statement 3: Deleting data row-by-row using delete statements is faster than deleting an entire partition. Which choice below correctly identifies the statements and true or false?
All the statements are true.
Statement 1 - True
Statement 2 - True
Statement 3 - False
Statement 1 - False
Statement 2 - True
Statement 3 - True
Statement 1 - True
Statement 2 - False
Statement 3 - True
Statement 1 - True
Statement 2 - True
Statement 3 - False
Explanation
The primary benefit of partitioning in a dedicated SQL pool is to improve the efficiency and performance of loading data by use of partition deletion, switching and merging. In most cases, data is partitioned on a date column that is closely tied to the order in which the data is loaded into the SQL pool.
One of the greatest benefits of using partitions to maintain data is the avoidance of transaction logging. While simply inserting, updating, or deleting data can be the most straightforward approach, with a little thought and effort, using partitioning during your load process can substantially improve performance. Partition switching can be used to quickly remove or replace a section of a table. For example, a sales fact table might contain just data for the past 36 months.
At the end of every month, the oldest month of sales data is deleted from the table. This data could be deleted by using a delete statement to delete the data for the oldest month.
There are two Azure Data Factory (ADF) Pipelines. The first pipeline (pipeline A) should be triggered when a new file is saved into the storage account. The second pipeline (pipeline B) should be triggered every 4 hours. Statement 1:Pipeline A should use an Event trigger. Statement 2: Pipeline B should use a Schedule trigger. Which choice below identifies the correct statement(s)?
Only statement 1 is correct.
Statements 1 and 2 are correct.
Only statement 2 is correct.
Neither statement 1 or 2 is correct.
Statements 1 and 2 are correct.
Explanation
Both statements are correct.
Triggers represent a unit of processing that determines when a pipeline execution needs to be kicked off. Currently, Data Factory supports three types of triggers:
Schedule trigger: A trigger that invokes a pipeline on a wall-clock schedule.
Tumbling window trigger: A trigger that operates on a periodic interval, while also retaining state.
Event-based trigger: A trigger that responds to an event.
A company is migrating three on-premises Microsoft SQL Server databases to Azure. The company would like to minimize the cost of running the service in Azure. They have analyzed the usage of the databases before migration as shown below: Database 1: Used predominantly during the first week of the month with heavy analytics and querying during working hours (8:00 am-6:00 pm). Database 2: Used throughout the month for querying although data is uploaded nightly Database 3: Used by the data science team to train their machine learning models within R and Python. This will be updated to be used within Azure Databricks once the migration has taken place. The training of the models will be performed daily. The data volumes can be handled easily by Azure SQL Database. How should the Azure SQL Databases be implemented?
Each having a set number of DTUs
Database 3 as an Azure SQL Data Warehouse and databases 1 and 2 on an Azure SQL VM
Convert database 3 to an Azure Data Lake (Gen 2) and databases 1 and 2 as Cosmos DB
Implement all as Azure SQL Databases included in a single elastic pool
Implement all as Azure SQL Databases included in a single elastic pool
Explanation
As the usage of the three databases is distributed to be used at different times, an elastic pool should be used to maximize the usage available to each when required. The cost of an Azure SQL Data Warehouse and Azure SQL VMs may provide better overall performance, but would not minimize the costs (requirement).
The following JSON is an example of a Parquet dataset on Azure Blob Storage.
{
“name”: “ParquetDataset”, “properties”:
{ “type”: “Parquet”, “linkedServiceName”:
{ “referenceName”: “”, “type”: “LinkedServiceReference”
},
“schema”: [ < physical schema, optional, retrievable during authoring > ], “typeProperties”: { “location”: { “type”: “AzureBlobStorageLocation”, “container”: “containername”, “folderPath”: “folder/subfolder”, },
“compressionCodec”: “LZO” }
}
}
Which of the JSON dataset properties is configured incorrectly?
“type”: “Parquet”
“schema”: [ < physical schema, optional, retrievable during authoring > ],
“type”: “AzureBlobStorageLocation”,
“compressionCodec”: “LZO”
“compressionCodec”: “LZO”
Explanation
The compression codec to use when writing to Parquet files. When reading from Parquet files, Data Factories automatically determine the compression codec based on the file metadata.
Supported types are “none”, “gzip”, “snappy” (default), and “lzo”.
Note currently Copy activity doesn’t support LZO when read/write Parquet files.
The following are two statements about Partitioning tables in a dedicated SQL pool: Statement 1: A query that applies a filter to partitioned data can limit the scan to only the qualifying partitions. Statement 2: Partitioning is only supported on hash distributed data. Which statements, if any, are correct?
Only statement 1 is correct.
Only statement 2 is correct.
Both statements 1 and 2 are correct.
Both statements 1 and 2 are incorrect.
Only statement 1 is correct.
Explanation
The primary benefit of partitioning in a dedicated SQL pool is to improve the efficiency and performance of loading data by use of partition deletion, switching and merging. In most cases, data is partitioned on a date column that is closely tied to the order in which the data is loaded into the SQL pool.
One of the greatest benefits of using partitions to maintain data is the avoidance of transaction logging. While simply inserting, updating, or deleting data can be the most straightforward approach, with a little thought and effort, using partitioning during your load process can substantially improve performance.
Partition switching can be used to quickly remove or replace a section of a table. For example, a sales fact table might contain just data for the past 36 months. At the end of every month, the oldest month of sales data is deleted from the table. This data could be deleted by using a delete statement to delete the data for the oldest month.
An electronics company utilizes Azure Data Lake Storage (ADLS) Generation 1 for Big Data Analytics. As part of the data analytics team, your new assignment is to plan and design the migration of ADLS Generation 1 to ADLS Generation 2. Only a small number of existing pipelines are connected to the current data lakes, but your team requires that the migration results in no downtime for any related applications and that the process requires minimal administration. Which of the following migration methods would best meet these requirements?
Lift and Shift
Incremental Copy
Dual Pipeline
Bidirectional Sync
Dual Pipeline
Explanation
The need for no downtime disqualifies Lift and Shift and Incremental copy from the consideration.
Dual Pipelines are ideal in situations where your workloads and applications can’t afford any downtime, and you can ingest into both storage accounts.
Bidirectional sync is ideal for complex scenarios that involve a large number of pipelines and dependencies where a phased approach might make more sense. However, bidirectional sync can require detailed planning and considerable administrative effort, so this leaves Dual Pipeline as the most suitable pattern.
A colleague has loaded a CSV into an Azure Databricks workspace that you share. You need to read the data from the CSV to process the data using Scala. The CSV holds the column names as the first row of the file. Which piece of code should you use to read the data, for which you do not have a preset schema?
Val sparkDF = spark.read.format(“csv”)
.options(“header”,”true”)
.options(“inferSchema”,”true”)
.load(“state_data.csv”)
Val sparkDF = spark.read.format(“csv”)
.options(“header”,”true”)
.load(“/FileStore/tables/state_data.csv”)
Val sparkDF = spark.read.format(“csv”)
.options(“header”,”true”)
.options(“inferSchema”,”true”)
.load(“/FileStore/tables/state_data.csv”)
Val sparkDF = spark.read.format(“csv”)
.options(“inferSchema”,”true”)
.load(“/FileStore/tables/state_data.csv”)
Val sparkDF = spark.read.format(“csv”)
.options(“header”,”true”)
.options(“inferSchema”,”true”)
.load(“/FileStore/tables/state_data.csv”)
Explanation
To get the CSV loaded you should specify the following:
File type = spark.read.format(“csv”)
First row as header = .options(“header”,”true”)
Read the schema in the first row = .options(“inferSchema”,”true”)
filename in the filestore = .load(“/FileStore/tables/state_data.csv”)
You are the data engineer for a very large e-commerce website with a global userbase. You have a data pipeline that gathers clickstream with Azure Event Hub and sends it to Azure Stream Analytics. You need to design a query that will: Aggregate the number of clicks into distinct periods of time Divide the numbers based on the user region Count each click once and only once Which of the following functions should be used in your query?
a tumbling window function
a session window function
a sliding window function
a hopping window function
a tumbling window function
Explanation
Tumbling window functions are used to segment a data stream into distinct time segments and perform a function against them. The key differentiators of a tumbling window are that they repeat, do not overlap, and an event cannot belong to more than one tumbling window.