Az Cloud Academy Certification test #2 Flashcards
An HDInsight cluster with eight D4v2 nodes is running two apps including the one you are going to run. You determined that 6 GB of executor-memory will be sufficient for an I/O intensive job. Since this is an I/O intensive job, you set the number of cores for each executor to 4. What are the memory constraint and CPU constraint for this job?
Memory constraint = 16 (rounded) and CPU constraint = 16
Memory constraint = 8 (rounded) and CPU constraint = 16
Memory constraint = 16 (rounded) and CPU constraint = 8
Memory constraint = 8 (rounded) and CPU constraint = 8
A: Memory constraint = 16 (rounded) and CPU constraint = 16
Explanation
Memory constraint and CPU constraint can be calculated based on the values given in the question.
For memory constraint the following formula can be used:
Memory constraint = (total YARN memory / executor memory) / # of apps
But here, we have first calculated the total YARN memory which can be found out by using the following equation:
Total YARN memory = nodes * YARN memory* per node
So, in this question Total YARN memory = 8 nodes * 25GB = 200GB so Memory constraint based on the equation specified earlier will be Memory constraint = (200GB/6GB)/2 =16 rounded.
For CPU constraint:
CPU constraint = (total YARN cores / # of cores per executor) / # of apps
But here, we need to calculate the yarn cores first by using the following formula:
YARN cores = nodes in cluster * # of cores per node * 2 YARN cores = 8 nodes * 8 cores per D14 * 2 = 128
So based on the equation specified earlier, the CPU constraint = (128/4)/2=16.
You are reviewing your existing Azure Synapse Analytics hash-distributed tables and considering ways to optimize querying costs. Tables with the highest cost include multiple fact tables that are often joined in table queries. How could you update the table distribution column to optimize query costs in this scenario?
Distribute the tables on the same join column
Choose a column that’s often in GROUP BY clauses
Choose a distribution key that will increase the number of partitions within the fact tables
Enable FORCE PLAN for these fact tables
Distribute the tables on the same join column
Explanation
Ideally, you should choose a distribution column that will spread the rows fairly evenly among the data distributions. If too many of the rows are on the same data distribution, then it will be a hot spot that reduces the advantages of Synapse Analytics’ massively parallel architecture. For example, if you were to choose a date column for the hash key, then all of the rows for a particular date would end up on the same distribution. So a query on that date would only run on that one distribution, which would make the query take much longer than if it were to run across all 60 distributions in parallel.
Here are some characteristics of a good distribution column. It has many unique values so the rows will be spread out over the 60 distributions. It’s frequently used in JOINs. If two fact tables are often joined together, then distribute both of the tables on the same join column. That way, rows from the two tables that have the same value in the join column will be stored on the same distribution, so they can be joined together easily. If you don’t have frequent joins, then choose a column that’s often in GROUP BY clauses. It’s not used in WHERE clauses, as this would limit query matches to only a few distributions.
You have to choose a real-time stream processing solution in Azure that meets the following requirements: The solution must support Event Hubs, IoT Hub, Kafka, and HDFS as input data sources. The solution must support custom code as an input data format. The solution needs built-in support for temporal processing. Cosmos DB should be a supported sink. Which is the most suitable solution?
Azure Stream Analytics
Apache Spark in Azure Databricks
HDInsight with Storm
Azure Functions
Apache Spark in Azure Databricks
Explanation
Apache Spark in Azure Databricks supports inputs from Event Hubs, IoT Hub, Kafka, HDFS, Storage Blobs, Azure Data Lake Store. It supports data in any format with custom code. It has built-in support for temporal processing and supported sinks are HDFS, Kafka, Storage Blobs, Azure Data Lake Store, Cosmos DB.
You are configuring data security settings for separate Azure SQL databases. Database A stores social security numbers, which you want to prevent any users or applications from viewing. The social security numbers appear in one column within a single table of Database A. Database B stores credit card information, including credit card numbers, which only privileged database administrators should be able to see. The credit card numbers appear in columns within several tables in Database B. How should you configure the data encryption settings for these databases to meet these requirements?
Enable ‘Always Encrypted’ for Database A, and Dynamic Data Masking (DDM) for Database B.
Enable ‘Always Encrypted’ for Database A and Database B.
Enable Dynamic Data Masking (DDM) for Database A, and ‘Always Encrypted’ for Database B.
Enable Dynamic Data Masking (DDM) for Database A and Database B.
Enable ‘Always Encrypted’ for Database A, and Dynamic Data Masking (DDM) for Database B.
Explanation
‘Always Encrypted’ prevents any users or applications from viewing or decrypting data, so in cases where data should be stored but never accessed by anyone accept the customer, this feature should be enabled.
Dynamic Data Masking allows only privileged users to view specific data.
How often data appears within a database would not affect the encryption feature you enable, only how you apply it, which is not a factor in answering this question.
Your company is being audited, and an external accountant needs access to review a blob container in the Blob service within one specific Azure storage account. You currently use Azure Active Directory to control access to the blob storage resources in question. However, you have been told you need to provide the accountant with immediate access to review the blob container in the storage account without any further information. How can you provide necessary access, but also limit it to the container in question?
Provide the accountant with read-only access to the specific Azure Blob container with a service-level shared access signature token to expire at the end of the business day. Specify the HTTPS protocol is required to accept requests.
Assign the accountant a guest role in Azure Active Directory with read-only access to the specific Azure Blob storage service in the Azure Storage account.
Provide the accountant with read-only access to the specific Azure Blob container with a user-delegation shared access signature token to expire at the end of the business day. Allow all read requests but limit write requests to LIST and GET. Specify the HTTPS protocol is required to accept requests.
Provide the accountant with contributor role access to the storage account using Azure AD role-based access control (RBAC).
Provide the accountant with read-only access to the specific Azure Blob container with a service-level shared access signature token to expire at the end of the business day. Specify the HTTPS protocol is required to accept requests.
Explanation
In this case, Azure Storage’s Shared Access Signature (SAS) is the best tool to provide limited, authorized access to the necessary blob resources. Remember, SAS allows two levels of access: service-level, which limits access to one type of storage within the Azure storage account, such as Blob, Table, Queue or File storage, and account level, which provides access to all storage types in a single account. The service level also allows you to limit access to specific containers, or even specific blobs, and control the actions that can be performed on the blobs by selecting approved common permission types such as read, write, list, or process.
You cannot provide a user-delegated SAS in this case because you do not know if the accountant has Azure AD credentials, which are required for this type of SAS.
You want to modify user-defined functions using T-SQL. The functions do not require a specific user-defined type, and you have ALTER permission on the schema. However, your ALTER FUNCTION requests fail. What will be the best possible cause?
You are trying to change a scalar-valued function to a table-valued function.
You must have ALTER permission on both the schema and on the function.
You need EXECUTE permission on the user-defined type.
The request altered related dependent functions and stored procedures.
You are trying to change a scalar-valued function to a table-valued function.
Explanation
You can modify user-defined functions in SQL Server by using SQL Server Management Studio or Transact-SQL. Modifying user-defined functions as described below will not change the functions’ permissions, nor will it affect any dependent functions, stored procedures, or triggers.
Changing a user-defined function requires ALTER permission on the function or on the schema, not both. If the function specifies a user-defined type, you need EXECUTE permission on the type, but in this scenario EXECUTE permission is not required.
When you use the ALTER function there are certain limitations
ALTER FUNCTION cannot be used to perform any of the following actions:
Change a scalar-valued function to a table-valued function, or vice versa.
Change an inline function to a multistatement function, or vice versa.
Change a Transact-SQL function to a CLR function, or vice-versa.
An electronics company utilizes Azure Data Lake Storage (ADLS) Generation 1 for Big Data Analytics. Your new assignment is to plan and design the migration of ADLS Generation 1 to ADLS Generation 2. You have decided to use the “incremental copy” migration pattern. Your team has drafted the steps involved in this migration pattern. They are as follows:
- Incrementally copy new data from Gen1.
- Start moving data from Gen1 to Gen2.
- Decommission Gen1
- Point workloads to Gen2
However, you find that the order of steps is incorrect. Which choice lists the steps in the correct order?
2 - 1 - 4 - 3
2 - 4 - 1 - 3
3 - 1 - 2 - 4
1 - 3 - 2 - 4
2 - 1 - 4 - 3
Explanation
For incremental copy patterns, the downtime is expected only during the cutover period. So decommissioning will be the last step. Also, workloads will be pointed to Gen 2 right before this decommissioning. The steps involved are given below in the right order.
Start moving data from Gen1 to Gen2.
Incrementally copy new data from Gen1.
After all of the data is copied, stop all writes to Gen1, and point workloads to Gen2.
Decommission Gen1.
A dedicated SQL pool containing user information is accidentally dropped while it is in paused state. Upon investigation, you find that no user-defined restore point was created before the SQL pool was dropped. As a result of this accident, what statement regarding a snapshot of the dedicated SQL pool is correct?
Azure Synapse Analytics creates a snapshot of the SQL pool with all restore points for the last 14 days.
Azure Synapse Analytics creates a final snapshot automatically before drop action.
Azure Synapse Analytics creates an automatic snapshot with a default restore point of the date the SQL pool is created.
Azure Synapse Analytics does not create a final snapshot.
Azure Synapse Analytics does not create a final snapshot.
Explanation
The SQL pool was in a paused state during this action. Usually, when you drop a dedicated SQL pool, a final snapshot is created and saved for seven days. You can restore the dedicated SQL pool to the final restore point created at deletion. If the dedicated SQL pool is dropped in a paused state, no snapshot is taken. In this scenario, make sure to create a user-defined restore point before dropping the dedicated SQL pool.
Which of the following statements regarding Azure Stream Analytics’ event ordering feature to manage late-arrived data is incorrect?
Early events are sent to the output if the sender’s clock is running too fast.
The System.Timestamp value is altered during event ingestion for very old events.
If input events are infrequent and sparse, the output can be delayed by that amount of time.
The event ordering feature allows Azure Stream Analytics to process data from any historical date.
The event ordering feature allows Azure Stream Analytics to process data from any historical date.
Explanation
While old events may seem harmless at first, because of the application of the late arrival tolerance, the old events may be dropped. If the events are too old, the System.Timestamp value is altered during event ingestion. Due to this behavior, currently Azure Stream Analytics is more suited for near-real-time event processing scenarios, instead of historical event processing scenarios. You can set the Events that arrive late to the largest possible value (20 days) to work around this behavior in some cases. So, the option which says it is not suitable for near-real time event processing is not a side effect of event ordering.
You are outlining the disaster recovery plan for your Azure SQL databases using active geo-replication. You plan to deploy your primary and secondary databases in different regions, to ensure greater availability in the event of a regional failure Both your primary and secondary databases’ firewalls should allow the same client IP address ranges, so that in the event the primary database fails the re-routed client-requests will successfully reach the secondary database. However, you do not want to allow more traffic than necessary to these or other databases in your production environment. What design choice below is recommended by Azure to ensure client requests are successfully received in the event of an unplanned failover?
Make sure your database-level firewall’s IP address rules are the same for your primary and secondary databases.
Make sure your server-level firewall’s IP address rules and database-level firewall’s IP address rules are the same for your primary and secondary databases.
Make sure your server-level firewall’s IP address rules and VNet rules are the same for your primary and secondary databases.
Make sure your database-level firewall’s IP address rules and VNet rules are the same for your primary and secondary databases.
Make sure your database-level firewall’s IP address rules are the same for your primary and secondary databases.
Explanation
The key concepts for this question are how databases are hosted on servers within Azure SQL database, understanding the different types of firewalls available for Azure SQL databases, and how requests are handled by the Azure SQL Database service.
First, a single Azure SQL Database server can host multiple separate databases. This helps explain the different kinds of database firewalls. There are server-level firewalls, database-level firewalls and finally VNet rules, which are also server-level. VNet rules pertain to requests from different subnets within the same virtual network to reach your database. Server-level firewalls allow or deny requests to all databases within a single Azure SQL Database server. Database-level firewalls allow or deny requests to a single database within a single Azure SQL Database server.
For example, an Azure SQL Database server may have one set of VNet rules allowing access from other servers in different subnets in its virtual network. It can also have one set of server-level firewall rules, allowing access to all 5 databases the server hosts, in this example. It can then have five separate sets of database-level firewall rules, one for each database that is hosted on the single server in this example.
The final process to understand is how requests are handled. When an Azure SQL Database receives a request, the request’s IP address is evaluated against the database-level IP address rules first, then server-level rules IP address rules. If the IP address is allowed by daabase-level firewall, it successfully reaches its specific database. Only if it fails is it then reviewed against the server-level firewall’s IP address rules.
A survey team collects data from a particular tollgate and uses Azure Stream analytics for Data Analysis. Here is the sample input: ╔═══════════════╤═══════╤══════════════════════════════╗ ║ License_plate │ Make │ Time ║ ╠═══════════════╪═══════╪══════════════════════════════╣ ║ DXE 5291 │ Make1 │ 2015-07-27T00:00:05.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ YZK 5704 │ Make3 │ 2015-07-27T00:02:17.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ RMV 8282 │ Make1 │ 2015-07-27T00:05:01.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ YHN 6970 │ Make2 │ 2015-07-27T00:06:00.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ VFE 1616 │ Make2 │ 2015-07-27T00:09:31.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ QYF 9358 │ Make1 │ 2015-07-27T00:12:02.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ MDR 6128 │ Make4 │ 2015-07-27T00:13:45.0000000Z ║ ╚═══════════════╧═══════╧══════════════════════════════╝ Which of the following SQL queries will give an output every time two consecutive cars with the same make pay the toll within a period of 90 seconds?
SELECT
Make,
Time,
License_plate AS Current_car_license_plate,
LAG(License_plate, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_license_plate,
LAG(Time, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_time
FROM
Input TIMESTAMP BY Time
WHERE
LAG(Make, 1) OVER (LIMIT DURATION(second, 90)) = Make
SELECT
Make,
Time,
License_plate AS Current_car_license_plate,
IsFirsy(License_plate, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_license_plate,
Isfirst(Time, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_time
FROM
Input TIMESTAMP BY Time
WHERE
Isfirst(Make, 1) OVER (LIMIT DURATION(second, 90)) = Make
SELECT
Make,
Time,
License_plate AS Current_car_license_plate,
tumblingwindow(License_plate, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_license_plate,
tumblingwindow(Time, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_time
FROM
Input TIMESTAMP BY Time
WHERE
tumblingwindow(Make, 1) OVER (LIMIT DURATION(second, 90)) = Make
SELECT
Make,
Time,
License_plate AS Current_car_license_plate,
hoppingwindow(License_plate, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_license_plate,
hoppingwindows(Time, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_time
FROM
Input TIMESTAMP BY Time
WHERE
hoppingwindows(Make, 1) OVER (LIMIT DURATION(second, 90)) = Make
SELECT
Make,
Time,
License_plate AS Current_car_license_plate,
LAG(License_plate, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_license_plate,
LAG(Time, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_time
FROM
Input TIMESTAMP BY Time
WHERE
LAG(Make, 1) OVER (LIMIT DURATION(second, 90)) = Make
Explanation
Correlating events in the same stream can be done by looking at past events using the LAG function. The LAG function can look into the input stream one event back and retrieve the Make value, comparing that with the Make value of the current event.
A data architect is currently designing an optimized Spark on HDInsight cluster that will process data stored in Azure Data Lake Storage in the North Central US region. The design should optimize parallel processing as much as possible. Which design choices would maximize parallel processing in an HDInsight cluster? (Choose 2 answers)
Use large VM instances to host worker nodes.
Use small VM instances to host worker nodes.
Opt for the lowest number of YARN containers per node.
Opt for the highest possible number of cores per YARN container.
Opt for the highest possible number of cores per YARN container.
Explanation
Azure documentation outlines how to optimize parallel processing and overall performance in an HD Insight cluster:
In the physical layer, a larger cluster will enable you to run more YARN containers.
Use smaller YARN containers. Reduce the size of each YARN container to create more containers with the same amount of resources.
Depending on your workload, there will always be a minimum YARN container size that is needed. If you pick too small a container, your jobs will run into out-of-memory issues.
Increase cores per YARN container. Increase the number of cores allocated to each container to increase the number of parallel tasks that run in each container. This works for applications like Spark which run multiple tasks per container.
A data engineer has two tables of structured data that needs to be stored in an Azure Synapse Analytics. The Sales table consists of 3 billion rows and 4 columns. The three main columns used for analysis are OrderDate, CustomerReference, and ProductID. Approximately 10 million rows will be added each day. The Products table will hold product information such as color, size, category, subcategory, list price, manufacturing cost, and supplier. Most of the analytics would require joins between the two tables. How should the distribution of the tables be set up within the SQL Synapse Analytics?
Sales – hash (OrderDate), Products – round_robin
Sales – replicated, Products – round_robin
Sales – hash (Productid), Products – replicated
Sales – hash (Productid), Products – hash (Productid)
Sales – hash (Productid), Products – hash (Productid)
Explanation
As joins are likely on the queries, it would be better for performance for the tables to distribute in the same way, using the hash of the join key.
A Data Engineer is given a set of 10,000 CSV documents stored in Azure Data Lake that hold rows of historic sales data that needs to be accessed in a one-off analysis process into an Azure Synapse Analytics. What is the preferred method to access the sales rows?
Import the sales rows in the Azure SQL Database using Azure Data Factory
Import the sales rows in the Azure SQL Database using SQL Integration Services
Access the sales rows using Polybase
Access the sales rows using a linked service
Access the sales rows using Polybase
Explanation
As the CSV data is to be used in a one-off analysis, it does not need to be imported into the Azure Synapse Analytics. The data should be read and analyzed from the source, which is Azure Data Lake. In the Azure environment, we do not have linked services (as Microsoft SQL Server has). The most efficient way to access the data is by using a Polybase external table.
A Data Engineer is setting up an Azure Data Factory pipeline to move data from an on-premises Microsoft SQL Server Database to Azure Synapse Analytics. The source table will be loaded by an internal process between 1:00 am and 3:00 am. The loading of the data should be started at 3:30 am. The transfer should be performed in the most efficient manner. What objects types will be required to setup this Azure Data Factory?
Azure Data Lake Storage, Linked Service (x2), Polybase, Dataset (x2), Pipeline, Trigger
Linked Service, Dataset, Pipeline, Trigger
Linked Service (x2), Polybase, Dataset (x2), Pipeline (x2)
Azure Data Lake Storage, Linked Service (x2), Polybase, Dataset (x2), Pipeline
Azure Data Lake Storage, Linked Service (x2), Polybase, Dataset (x2), Pipeline, Trigger
Explanation
The most efficient manner to transfer this data is using a Polybase external table within the Copy Data from the Microsoft SQL Server to the Azure Synapse Analytics. The copy data will use a data lake to perform the load. The objects to be created would be:
Linked Service to Microsoft SQL Server
Linked Service to Azure Synapse Analytics
Dataset to source table (linked service = Microsoft SQL Server)
Dataset to destination table (linked service = Azure Synapse Analytics)
Azure Data Lake Storage connection
Pipeline holding the Copy Data process
Trigger to start the import at 3:30 am