Az Cloud Academy Certification test #2 Flashcards

1
Q

An HDInsight cluster with eight D4v2 nodes is running two apps including the one you are going to run. You determined that 6 GB of executor-memory will be sufficient for an I/O intensive job. Since this is an I/O intensive job, you set the number of cores for each executor to 4. What are the memory constraint and CPU constraint for this job?
Memory constraint = 16 (rounded) and CPU constraint = 16
Memory constraint = 8 (rounded) and CPU constraint = 16
Memory constraint = 16 (rounded) and CPU constraint = 8
Memory constraint = 8 (rounded) and CPU constraint = 8

A

A: Memory constraint = 16 (rounded) and CPU constraint = 16

Explanation
Memory constraint and CPU constraint can be calculated based on the values given in the question.

For memory constraint the following formula can be used:

Memory constraint = (total YARN memory / executor memory) / # of apps

But here, we have first calculated the total YARN memory which can be found out by using the following equation:

Total YARN memory = nodes * YARN memory* per node

So, in this question Total YARN memory = 8 nodes * 25GB = 200GB so Memory constraint based on the equation specified earlier will be Memory constraint = (200GB/6GB)/2 =16 rounded.

For CPU constraint:

CPU constraint = (total YARN cores / # of cores per executor) / # of apps

But here, we need to calculate the yarn cores first by using the following formula:

YARN cores = nodes in cluster * # of cores per node * 2 YARN cores = 8 nodes * 8 cores per D14 * 2 = 128

So based on the equation specified earlier, the CPU constraint = (128/4)/2=16.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

You are reviewing your existing Azure Synapse Analytics hash-distributed tables and considering ways to optimize querying costs. Tables with the highest cost include multiple fact tables that are often joined in table queries. How could you update the table distribution column to optimize query costs in this scenario?

Distribute the tables on the same join column

Choose a column that’s often in GROUP BY clauses

Choose a distribution key that will increase the number of partitions within the fact tables

Enable FORCE PLAN for these fact tables

A

Distribute the tables on the same join column

Explanation
Ideally, you should choose a distribution column that will spread the rows fairly evenly among the data distributions. If too many of the rows are on the same data distribution, then it will be a hot spot that reduces the advantages of Synapse Analytics’ massively parallel architecture. For example, if you were to choose a date column for the hash key, then all of the rows for a particular date would end up on the same distribution. So a query on that date would only run on that one distribution, which would make the query take much longer than if it were to run across all 60 distributions in parallel.

Here are some characteristics of a good distribution column. It has many unique values so the rows will be spread out over the 60 distributions. It’s frequently used in JOINs. If two fact tables are often joined together, then distribute both of the tables on the same join column. That way, rows from the two tables that have the same value in the join column will be stored on the same distribution, so they can be joined together easily. If you don’t have frequent joins, then choose a column that’s often in GROUP BY clauses. It’s not used in WHERE clauses, as this would limit query matches to only a few distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

You have to choose a real-time stream processing solution in Azure that meets the following requirements: The solution must support Event Hubs, IoT Hub, Kafka, and HDFS as input data sources. The solution must support custom code as an input data format. The solution needs built-in support for temporal processing. Cosmos DB should be a supported sink. Which is the most suitable solution?

Azure Stream Analytics
Apache Spark in Azure Databricks
HDInsight with Storm
Azure Functions

A

Apache Spark in Azure Databricks

Explanation
Apache Spark in Azure Databricks supports inputs from Event Hubs, IoT Hub, Kafka, HDFS, Storage Blobs, Azure Data Lake Store. It supports data in any format with custom code. It has built-in support for temporal processing and supported sinks are HDFS, Kafka, Storage Blobs, Azure Data Lake Store, Cosmos DB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

You are configuring data security settings for separate Azure SQL databases. Database A stores social security numbers, which you want to prevent any users or applications from viewing. The social security numbers appear in one column within a single table of Database A. Database B stores credit card information, including credit card numbers, which only privileged database administrators should be able to see. The credit card numbers appear in columns within several tables in Database B. How should you configure the data encryption settings for these databases to meet these requirements?

Enable ‘Always Encrypted’ for Database A, and Dynamic Data Masking (DDM) for Database B.

Enable ‘Always Encrypted’ for Database A and Database B.

Enable Dynamic Data Masking (DDM) for Database A, and ‘Always Encrypted’ for Database B.

Enable Dynamic Data Masking (DDM) for Database A and Database B.

A

Enable ‘Always Encrypted’ for Database A, and Dynamic Data Masking (DDM) for Database B.

Explanation
‘Always Encrypted’ prevents any users or applications from viewing or decrypting data, so in cases where data should be stored but never accessed by anyone accept the customer, this feature should be enabled.

Dynamic Data Masking allows only privileged users to view specific data.

How often data appears within a database would not affect the encryption feature you enable, only how you apply it, which is not a factor in answering this question.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Your company is being audited, and an external accountant needs access to review a blob container in the Blob service within one specific Azure storage account. You currently use Azure Active Directory to control access to the blob storage resources in question. However, you have been told you need to provide the accountant with immediate access to review the blob container in the storage account without any further information. How can you provide necessary access, but also limit it to the container in question?

Provide the accountant with read-only access to the specific Azure Blob container with a service-level shared access signature token to expire at the end of the business day. Specify the HTTPS protocol is required to accept requests.

Assign the accountant a guest role in Azure Active Directory with read-only access to the specific Azure Blob storage service in the Azure Storage account.

Provide the accountant with read-only access to the specific Azure Blob container with a user-delegation shared access signature token to expire at the end of the business day. Allow all read requests but limit write requests to LIST and GET. Specify the HTTPS protocol is required to accept requests.

Provide the accountant with contributor role access to the storage account using Azure AD role-based access control (RBAC).

A

Provide the accountant with read-only access to the specific Azure Blob container with a service-level shared access signature token to expire at the end of the business day. Specify the HTTPS protocol is required to accept requests.

Explanation
In this case, Azure Storage’s Shared Access Signature (SAS) is the best tool to provide limited, authorized access to the necessary blob resources. Remember, SAS allows two levels of access: service-level, which limits access to one type of storage within the Azure storage account, such as Blob, Table, Queue or File storage, and account level, which provides access to all storage types in a single account. The service level also allows you to limit access to specific containers, or even specific blobs, and control the actions that can be performed on the blobs by selecting approved common permission types such as read, write, list, or process.

You cannot provide a user-delegated SAS in this case because you do not know if the accountant has Azure AD credentials, which are required for this type of SAS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

You want to modify user-defined functions using T-SQL. The functions do not require a specific user-defined type, and you have ALTER permission on the schema. However, your ALTER FUNCTION requests fail. What will be the best possible cause?

You are trying to change a scalar-valued function to a table-valued function.

You must have ALTER permission on both the schema and on the function.

You need EXECUTE permission on the user-defined type.

The request altered related dependent functions and stored procedures.

A

You are trying to change a scalar-valued function to a table-valued function.

Explanation
You can modify user-defined functions in SQL Server by using SQL Server Management Studio or Transact-SQL. Modifying user-defined functions as described below will not change the functions’ permissions, nor will it affect any dependent functions, stored procedures, or triggers.

Changing a user-defined function requires ALTER permission on the function or on the schema, not both. If the function specifies a user-defined type, you need EXECUTE permission on the type, but in this scenario EXECUTE permission is not required.

When you use the ALTER function there are certain limitations

ALTER FUNCTION cannot be used to perform any of the following actions:
Change a scalar-valued function to a table-valued function, or vice versa.
Change an inline function to a multistatement function, or vice versa.
Change a Transact-SQL function to a CLR function, or vice-versa.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

An electronics company utilizes ‎Azure Data Lake Storage (ADLS) Generation 1 for Big Data Analytics. Your new assignment is to plan and design the migration of ADLS Generation 1 to ADLS Generation 2. You have decided to use the “incremental copy” migration pattern. Your team has drafted the steps involved in this migration pattern. They are as follows:

  1. Incrementally copy new data from Gen1.
  2. Start moving data from Gen1 to Gen2.
  3. Decommission Gen1
  4. Point workloads to Gen2

However, you find that the order of steps is incorrect. Which choice lists the steps in the correct order?

2 - 1 - 4 - 3

2 - 4 - 1 - 3

3 - 1 - 2 - 4

1 - 3 - 2 - 4

A

2 - 1 - 4 - 3

Explanation
For incremental copy patterns, the downtime is expected only during the cutover period. So decommissioning will be the last step. Also, workloads will be pointed to Gen 2 right before this decommissioning. The steps involved are given below in the right order.

Start moving data from Gen1 to Gen2.
Incrementally copy new data from Gen1.
After all of the data is copied, stop all writes to Gen1, and point workloads to Gen2.
Decommission Gen1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A dedicated SQL pool containing user information is accidentally dropped while it is in paused state. Upon investigation, you find that no user-defined restore point was created before the SQL pool was dropped. As a result of this accident, what statement regarding a snapshot of the dedicated SQL pool is correct?

Azure Synapse Analytics creates a snapshot of the SQL pool with all restore points for the last 14 days.

Azure Synapse Analytics creates a final snapshot automatically before drop action.

Azure Synapse Analytics creates an automatic snapshot with a default restore point of the date the SQL pool is created.

Azure Synapse Analytics does not create a final snapshot.

A

Azure Synapse Analytics does not create a final snapshot.

Explanation
The SQL pool was in a paused state during this action. Usually, when you drop a dedicated SQL pool, a final snapshot is created and saved for seven days. You can restore the dedicated SQL pool to the final restore point created at deletion. If the dedicated SQL pool is dropped in a paused state, no snapshot is taken. In this scenario, make sure to create a user-defined restore point before dropping the dedicated SQL pool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which of the following statements regarding Azure Stream Analytics’ event ordering feature to manage late-arrived data is incorrect?

Early events are sent to the output if the sender’s clock is running too fast.

The System.Timestamp value is altered during event ingestion for very old events.

If input events are infrequent and sparse, the output can be delayed by that amount of time.

The event ordering feature allows Azure Stream Analytics to process data from any historical date.

A

The event ordering feature allows Azure Stream Analytics to process data from any historical date.

Explanation
While old events may seem harmless at first, because of the application of the late arrival tolerance, the old events may be dropped. If the events are too old, the System.Timestamp value is altered during event ingestion. Due to this behavior, currently Azure Stream Analytics is more suited for near-real-time event processing scenarios, instead of historical event processing scenarios. You can set the Events that arrive late to the largest possible value (20 days) to work around this behavior in some cases. So, the option which says it is not suitable for near-real time event processing is not a side effect of event ordering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You are outlining the disaster recovery plan for your Azure SQL databases using active geo-replication. You plan to deploy your primary and secondary databases in different regions, to ensure greater availability in the event of a regional failure Both your primary and secondary databases’ firewalls should allow the same client IP address ranges, so that in the event the primary database fails the re-routed client-requests will successfully reach the secondary database. However, you do not want to allow more traffic than necessary to these or other databases in your production environment. What design choice below is recommended by Azure to ensure client requests are successfully received in the event of an unplanned failover?

Make sure your database-level firewall’s IP address rules are the same for your primary and secondary databases.

Make sure your server-level firewall’s IP address rules and database-level firewall’s IP address rules are the same for your primary and secondary databases.

Make sure your server-level firewall’s IP address rules and VNet rules are the same for your primary and secondary databases.

Make sure your database-level firewall’s IP address rules and VNet rules are the same for your primary and secondary databases.

A

Make sure your database-level firewall’s IP address rules are the same for your primary and secondary databases.

Explanation
The key concepts for this question are how databases are hosted on servers within Azure SQL database, understanding the different types of firewalls available for Azure SQL databases, and how requests are handled by the Azure SQL Database service.

First, a single Azure SQL Database server can host multiple separate databases. This helps explain the different kinds of database firewalls. There are server-level firewalls, database-level firewalls and finally VNet rules, which are also server-level. VNet rules pertain to requests from different subnets within the same virtual network to reach your database. Server-level firewalls allow or deny requests to all databases within a single Azure SQL Database server. Database-level firewalls allow or deny requests to a single database within a single Azure SQL Database server.

For example, an Azure SQL Database server may have one set of VNet rules allowing access from other servers in different subnets in its virtual network. It can also have one set of server-level firewall rules, allowing access to all 5 databases the server hosts, in this example. It can then have five separate sets of database-level firewall rules, one for each database that is hosted on the single server in this example.

The final process to understand is how requests are handled. When an Azure SQL Database receives a request, the request’s IP address is evaluated against the database-level IP address rules first, then server-level rules IP address rules. If the IP address is allowed by daabase-level firewall, it successfully reaches its specific database. Only if it fails is it then reviewed against the server-level firewall’s IP address rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A survey team collects data from a particular tollgate and uses Azure Stream analytics for Data Analysis. Here is the sample input: ╔═══════════════╤═══════╤══════════════════════════════╗ ║ License_plate │ Make │ Time ║ ╠═══════════════╪═══════╪══════════════════════════════╣ ║ DXE 5291 │ Make1 │ 2015-07-27T00:00:05.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ YZK 5704 │ Make3 │ 2015-07-27T00:02:17.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ RMV 8282 │ Make1 │ 2015-07-27T00:05:01.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ YHN 6970 │ Make2 │ 2015-07-27T00:06:00.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ VFE 1616 │ Make2 │ 2015-07-27T00:09:31.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ QYF 9358 │ Make1 │ 2015-07-27T00:12:02.0000000Z ║ ╟───────────────┼───────┼──────────────────────────────╢ ║ MDR 6128 │ Make4 │ 2015-07-27T00:13:45.0000000Z ║ ╚═══════════════╧═══════╧══════════════════════════════╝ Which of the following SQL queries will give an output every time two consecutive cars with the same make pay the toll within a period of 90 seconds?

SELECT
Make,
Time,
License_plate AS Current_car_license_plate,
LAG(License_plate, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_license_plate,
LAG(Time, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_time
FROM
Input TIMESTAMP BY Time
WHERE
LAG(Make, 1) OVER (LIMIT DURATION(second, 90)) = Make

SELECT
Make,
Time,
License_plate AS Current_car_license_plate,
IsFirsy(License_plate, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_license_plate,
Isfirst(Time, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_time
FROM
Input TIMESTAMP BY Time
WHERE
Isfirst(Make, 1) OVER (LIMIT DURATION(second, 90)) = Make

SELECT
Make,
Time,
License_plate AS Current_car_license_plate,
tumblingwindow(License_plate, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_license_plate,
tumblingwindow(Time, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_time
FROM
Input TIMESTAMP BY Time
WHERE
tumblingwindow(Make, 1) OVER (LIMIT DURATION(second, 90)) = Make

SELECT
Make,
Time,
License_plate AS Current_car_license_plate,
hoppingwindow(License_plate, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_license_plate,
hoppingwindows(Time, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_time
FROM
Input TIMESTAMP BY Time
WHERE
hoppingwindows(Make, 1) OVER (LIMIT DURATION(second, 90)) = Make

A

SELECT
Make,
Time,
License_plate AS Current_car_license_plate,
LAG(License_plate, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_license_plate,
LAG(Time, 1) OVER (LIMIT DURATION(second, 90)) AS First_car_time
FROM
Input TIMESTAMP BY Time
WHERE
LAG(Make, 1) OVER (LIMIT DURATION(second, 90)) = Make

Explanation
Correlating events in the same stream can be done by looking at past events using the LAG function. The LAG function can look into the input stream one event back and retrieve the Make value, comparing that with the Make value of the current event.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A data architect is currently designing an optimized Spark on HDInsight cluster that will process data stored in Azure Data Lake Storage in the North Central US region. The design should optimize parallel processing as much as possible. Which design choices would maximize parallel processing in an HDInsight cluster? (Choose 2 answers)
Use large VM instances to host worker nodes.

Use small VM instances to host worker nodes.

Opt for the lowest number of YARN containers per node.

Opt for the highest possible number of cores per YARN container.

A

Opt for the highest possible number of cores per YARN container.

Explanation
Azure documentation outlines how to optimize parallel processing and overall performance in an HD Insight cluster:

In the physical layer, a larger cluster will enable you to run more YARN containers.
Use smaller YARN containers. Reduce the size of each YARN container to create more containers with the same amount of resources.
Depending on your workload, there will always be a minimum YARN container size that is needed. If you pick too small a container, your jobs will run into out-of-memory issues.
Increase cores per YARN container. Increase the number of cores allocated to each container to increase the number of parallel tasks that run in each container. This works for applications like Spark which run multiple tasks per container.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A data engineer has two tables of structured data that needs to be stored in an Azure Synapse Analytics. The Sales table consists of 3 billion rows and 4 columns. The three main columns used for analysis are OrderDate, CustomerReference, and ProductID. Approximately 10 million rows will be added each day. The Products table will hold product information such as color, size, category, subcategory, list price, manufacturing cost, and supplier. Most of the analytics would require joins between the two tables. How should the distribution of the tables be set up within the SQL Synapse Analytics?

Sales – hash (OrderDate), Products – round_robin

Sales – replicated, Products – round_robin

Sales – hash (Productid), Products – replicated

Sales – hash (Productid), Products – hash (Productid)

A

Sales – hash (Productid), Products – hash (Productid)

Explanation
As joins are likely on the queries, it would be better for performance for the tables to distribute in the same way, using the hash of the join key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A Data Engineer is given a set of 10,000 CSV documents stored in Azure Data Lake that hold rows of historic sales data that needs to be accessed in a one-off analysis process into an Azure Synapse Analytics. What is the preferred method to access the sales rows?

Import the sales rows in the Azure SQL Database using Azure Data Factory

Import the sales rows in the Azure SQL Database using SQL Integration Services

Access the sales rows using Polybase

Access the sales rows using a linked service

A

Access the sales rows using Polybase

Explanation
As the CSV data is to be used in a one-off analysis, it does not need to be imported into the Azure Synapse Analytics. The data should be read and analyzed from the source, which is Azure Data Lake. In the Azure environment, we do not have linked services (as Microsoft SQL Server has). The most efficient way to access the data is by using a Polybase external table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A Data Engineer is setting up an Azure Data Factory pipeline to move data from an on-premises Microsoft SQL Server Database to Azure Synapse Analytics. The source table will be loaded by an internal process between 1:00 am and 3:00 am. The loading of the data should be started at 3:30 am. The transfer should be performed in the most efficient manner. What objects types will be required to setup this Azure Data Factory?

Azure Data Lake Storage, Linked Service (x2), Polybase, Dataset (x2), Pipeline, Trigger

Linked Service, Dataset, Pipeline, Trigger

Linked Service (x2), Polybase, Dataset (x2), Pipeline (x2)

Azure Data Lake Storage, Linked Service (x2), Polybase, Dataset (x2), Pipeline

A

Azure Data Lake Storage, Linked Service (x2), Polybase, Dataset (x2), Pipeline, Trigger

Explanation
The most efficient manner to transfer this data is using a Polybase external table within the Copy Data from the Microsoft SQL Server to the Azure Synapse Analytics. The copy data will use a data lake to perform the load. The objects to be created would be:

Linked Service to Microsoft SQL Server
Linked Service to Azure Synapse Analytics
Dataset to source table (linked service = Microsoft SQL Server)
Dataset to destination table (linked service = Azure Synapse Analytics)
Azure Data Lake Storage connection
Pipeline holding the Copy Data process
Trigger to start the import at 3:30 am

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

You are designing a data table in Azure Table Storage for optimal performance when handling a large number of read requests and a much smaller number of write requests. As part of your design preparation, you have a list of the most common queries to expect for the table.All queries will be point queries, with exact matches for the partition and row of the desired item.The data includes a userID, which corresponds to a single row in the data table, userLocation, which applies to all users in a specific area and corresponds to multiple rows in the table, and purchaseID, which corresponds to multiple entities in a row.The primary entity in queries is the userID.With this information, which of the following design choices is optimal for scalability and handing requests in this situation?

Use the userID as the partition key with an empty string as the row key.

Use the userLocation as the partition key and the User ID as the row key.

Use the userLocation as the partition key and the purchaseID as the row key.

Use the purchaseID as the partition key and the userID as the row key.

A

Use the userID as the partition key with an empty string as the row key.

Explanation
Designing the userID as the partition key with an empty row key makes the most sense in this situation, where scalability is the leading requirement, the userID is the key property in the table, and all queries are point queries.

The userID as a partition key would create many single-row partitions, which offers the highest scalability and would quickly address queries because point queries are exact, and require no scans across multiple partitions.

17
Q

An electronics company needs to migrate aAzure Data Lake Storage from ADLS Gen1 to Gen2 for a data analysis solution. The following are the key properties defining the current setup of ADLS Gen1 Data Organization with file and folder support Ecosystem based on Azure Databricks 3.1 for big data analytics Encryption of data at rest using customer managed key Traffic accepted only from Specific VMs in integrated Virtual Network VNet integration utilizes network service endpoint security for authentication to ADLS storage Which of the above properties requires refactoring or re-architecting to complete migration successfully? (Choose 2 answers)

Data Organization with file and folder support

Ecosystem based on Azure Databricks 3.1 for big data analytics

Encryption of data at rest using customer managed key

VNet integration utilizes network service endpoint security for authentication to ADLS storage

A

Ecosystem based on Azure Databricks 3.1 for big data analytics

VNet integration utilizes network service endpoint security for authentication to ADLS storage

Explanation
The two configurations that need refactoring/re-architecting are:

Ecosystem based on Azure Databricks 3.1 for big data analytics
VNet integration utilizes network service endpoint security for authentication to ADLS storage
Vnet Support for ADLS Gen 2 comes with private endpoints and service endpoints and does not need network service endpoint security for authentication to ADLS storage.

Minimum Azure Databricks version supported by ADLS Gen2 is 5.1. So updating is required for this as well.

18
Q

You have an Azure Synaptics workspace backed by Azure Data Lake Storage (ADLS) Gen 2. Files and folders are arranged as follows in the ADLS. A folder named ‘/webdata/’ contains two subfolders named ‘/month/’ and ‘/week/’. The ‘/webdata/’ folder contains two files named ‘mydata1.csv’ and ‘mydata2.csv’. The ‘/month/’ subfolder contains a file named ‘mydata3.csv’ and the ‘/week/’ subfolder contains a file named ‘mydata4.csv’. You ran the following code to create an external table. CREATE EXTERNAL TABLE userlocation_external_table ( decennialTime varchar(20), stateName varchar(100), countyName varchar(100) ) WITH ( LOCATION = ‘/webdata/’, ) After some time you query this external table. Which files will return?

No files

mydata2. csv and mydata1.csv
mydata3. csv and mydata4.csv
mydata3. csv, mydata4.csv, mydata2.csv and mydata1.csv

A

mydata3.csv, mydata4.csv, mydata2.csv and mydata1.csv

Explanation
In the CREATE statement we can see that it uses location as a webdata folder, this will help the query to treat all the folders and files under that particular folder as a single list/entity. So all the files will be shown up as the result of the query. To run a T-SQL query over a set of files within a folder or set of folders while treating them as a single entity or rowset, provide a path to a folder or a pattern (using wildcards) over a set of files or folders.

19
Q

You need to transfer large amounts of key-value data to Azure for further processing and analysis and are reviewing the range of data storage and transfer methods to find which one best suits your use case. Here are the key factors to consider:You will be gathering roughly 25 TBs of key-value data a month to be batched in once on a monthly basis.The source data is gathered from multiple IoT sensors and stored on a local on-premise data drive.You are managing the data transfer yourself, for a small university study, and have little experience in managing data flow.Your current network bandwidth is 1 Gbps.The data will need minimal reformatting before you upload and perform SQL queries in Azure Synapse Analytics.You have identified Azure blob storage as the best destination for the initial upload of raw data. What service should you use to batch the data into Azure storage, given the above criteria?

Azure Data Factory

Azure Data Box

Azure Storage REST APIs

AzCopy

A

AzCopy

Explanation
The key criteria for this question are the size of the data, the frequency of the batching, the amount of bandwidth, the number of data sources and formats, the final format for analysis, and the existing skillset.

For this reason, AzCopy is the best response because it is best for transferring single types of data, such as key-value data in this case. You can use AzCopy from a Windows or Linux command-line to easily copy data to and from Azure Blob, File, and Table storage with optimal performance. AzCopy supports concurrency and parallelism, and the ability to resume copy operations when interrupted. You can also use AzCopy to copy data from AWS to Azure. For programmatic access, the Microsoft Azure Storage Data Movement Library is the core framework that powers AzCopy. It is provided as a .NET Core library.

Azure Data Factory could be used, but it is less appropriate in this use case because it is most often used for different types of data rather than one type of data. It is a managed service best suited for regularly transferring files between a number of Azure services, on-premises, or a combination of the two. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that ingest data from disparate data stores. It can process and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.

25 TB of raw data per month is a large amount of data on a periodic basis. The network bandwidth is large enough to use network transfers, so Azure Data Box is not required. The raw data is in one format, being uploaded from a single source, and needs minimal processing, so Azure Data Factory may be too much tooling for a simple task. The end purpose is data analysis, and not to support a running application, so Azure Storage REST APIs are not essential.

20
Q

As a data engineer for a non-profit fundraising service, you are managing the migration of fundraising data to Azure Table storage. The organization manages events for scouting organizations across the United States. Your database organizes multiple scouting troops from different scouting organizations by zip code. Each scouting organization can register a number of scouts, ranging from 200 to 2000 scouts per organization per zip code. Each organization divides registered scouts in a single zip code into troops, with 20-50 scouts each per troop. Each troop is assigned a troop leader. The number of scouting organizations and total scouts varies heavily from zipcode to zipcode. Considering your database design, your top priority is scalability to avoid hotspotting throughout the year. Other key priorities are to allow for efficient queries and bulk transactions as much as possible. Which design for your partition and row key will prioritize scalability first, and allow for efficient queries and bulk transactions as much as possible?

Concatenate the scout name and the troop leader as the partition key, and concatenate the zip code and scouting organization as the row key.

Concatenate the zip code and the scouting organization as the partition key, and use the troop leader as the row key.

Concatenate the zip code and the troop leader name as the partition key, and concatenate the scouting organization and scout name as the row key.

Create separate entities with duplicate information using the zip code as the partition key. Concatenate the troop leader name and scouting organization in the first entity’s row key. Concatenate the troop leader name and scout name in the second entity’s row key.

A

Concatenate the zip code and the troop leader name as the partition key, and concatenate the scouting organization and scout name as the row key.

Explanation
The goal of this design is scalability, efficient queries and avoiding over partitioning, which will be ideal for bulk transactions in a single partition. Every time you have to update scouts assigned to a troop leader, you will cross partition server boundaries.

Using the scout name in the partition key will over partition, so that is not ideal. Concatenating the zipcode and scouting organization as the partition key will create smaller partitions than the zip code or scouting organization alone, and allow for bulk transactions, but will still create a very unevenly sized partitioning which will likely lead to hotspotting.

Concatenating the troop leader name and the zip code will create small, relatively even partition sizes to minimize hotspotting, allow for scaling, and still allow some bulk transactions.

Creating duplicate entities is ideal in cases when you are running many queries, and have several dominant queries that you cannot deprioritize. This does not apply in this case.

21
Q

You are configuring an Azure SQL database with geo-replication enabled, and now want to configure Azure SQL Database auditing to monitor database events and ensure compliance. While the majority of databases on the Azure SQL database instance will focus on managing application data, one database will focus on processing environment logs as a way to maximize resource costs. The logs database’s events will be very different from the other databases’ events. The primary issue is auditing event types or categories for a specific database that differ from the rest of the databases on the server. What is the most effective way to enable auditing for this Azure SQL Database?
Enable server-level auditing on the primary database.

Enable database-level auditing on the primary and the secondary database.

Enable server-level auditing for the server and database-level auditing on logs database.

Enable database-level auditing on the primary database.

A

Enable server-level auditing for the server and database-level auditing on logs database.

Explanation
There are mainly two scenarios when both server-level auditing and database-level auditing are enabled.

You want to use a different storage account, retention period or Log Analytics Workspace for a specific database.
You want to audit event types or categories for a specific database that differ from the rest of the databases on the server.
For example, you might have table inserts that need to be audited only for a specific database
With geo-replicated databases, when you enable auditing on the primary database the secondary database will have an identical auditing policy. It is also possible to set up auditing on the secondary database by enabling auditing on the secondary server, independently from the primary database, but this is not necessary in this case.

22
Q

You are setting authentication to Azure DataBricks REST API using Azure Active Directory (Azure AD) tokens. You have created the service principal and there is an associated AD access token with it. You are executing the curl command with resource parameter as the identifier for the AzureDatabricks login application. Which of the following parameters also should be part of this curl request? (Choose 3 answers)

Tenant ID
Subscription ID
Client ID
Application secret

A

Tenant ID

Client ID

Application secret

Explanation
In this scenario, you will be using the client credentials flow to get an access token with the AzureDatabricks login application as the resource. More details about client credentials flow method can be found in the Links attached. Here is the syntax of curl command which gives you expected response including the access token:

curl -X POST -H ‘Content-Type: application/x-www-form-urlencoded’ \ -d ‘grant_type=client_credentials&client_id=&resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d&client_secret=’ \ https://login.microsoftonline.com//oauth2/token

The curl command used here will include the Tenant ID, application secret, client ID, but not subscription ID.

23
Q

A company needs a storage solution for many different kinds of data from a variety of data sources such as their Line of Business applications, analytic streams from their web applications, and data captured from IoT sensors. Here are some further requirements:Unstructured data storage is preferredThere are no compliance requirementsThe solution should integrate well with Hadoop.Which of the below Azure services would be best suited for data storage in this case?

Azure Synapse Analytics

Azure Table Storage

Azure Data Lake Store

Azure Cosmos DB

A

Azure Data Lake Store

Explanation
Azure Data Lake Storage is the only solution that meets all of the requirements in this scenario. If a solution is required for unstructured data, then SQL Data Warehouse is clearly not ideal. Of the remaining choices, neither Cosmos DB nor Table Storage supports Hadoop, so that makes Azure Data Lake Store the correct choice.

24
Q

You have created an Azure Synapse Analytics managed table backed by Parquet in Spark and query from a serverless SQL pool. The following command was used:CREATE TABLE mytestdb.myparquettable(id int, name string, birthdate date) USING ParquetThen you add a row in the table with values as given.╔═════╤══════╤════════════╗ ║ id │ name │ birthdate ║ ╠═════╪══════╪════════════╣ ║ 123 │ abc │ 01-01-2001 ║ ╚═════╧══════╧════════════╝ What will be the output of the following query which will be run after a few mins? SELECT id FROM mytestdb.dbo.myparquettable WHERE birthdate = ‘01-01-2001.’

ERROR

NULL

123

ABC

A

123

Explanation
This CREATE TABLE command creates the table myparquettable in the database mytestdb. Table names will be converted to lowercase. After a short delay, you can see the table in your serverless SQL pool. So the query will function as any other normal SQL query ran on table, after a few minutes of delay. The query will return 123 as a result.

25
Q

As your company’s database administrator and owner of a Cosmos DB account, you need to create a new Cosmos DB database to support an application currently being developed. You also need to grant access to a member of your IT staff who will be testing the new application. The developer will need to create containers and modify the Cosmos DB database settings to fine-tune them. Additionally, you will need to create the necessary credentials for the application, which will be hosted on Azure App Service web apps, to connect with the database, and will upload, modify and read data to fulfill expected requests. To simplify the testing process, you would like to create a set of application credentials that persists while the test resources themselves may be continuously created and deleted throughout the development process. You also want to provide access to the IT staff member following general security best practices. How should you proceed?
Create the database using the primary or secondary read-write master key. Assign the IT staff member Account Contributor role through Azure Active Directory. Create a user-assigned managed identity for the application hosted on Azure App Service web apps.

Create the database using the primary read-write master key. Provide access to the IT staff member using the secondary read-write master key. Create a user-assigned managed identity for the application hosted on Azure App Service web apps.

Create the database using the primary or secondary read-write master key. Assign the IT staff member Account Contributor role through Azure Active Directory. Create a system-assigned managed identity for the application hosted on Azure App Service web apps.

Create the database using the primary or secondary read-write master key. Provide access to the IT staff member using the primary read-only master key. Create a system-assigned managed identity for the application hosted on Azure App Service web apps.

A

Create the database using the primary or secondary read-write master key. Assign the IT staff member Account Contributor role through Azure Active Directory. Create a user-assigned managed identity for the application hosted on Azure App Service web apps.

Explanation
The master keys are essentially the root access keys for the Cosmos DB account owner, and can be used to create resources, but should not be shared. Assigning permissions via RBAC is the best course of action in this case, and creating a user-assigned managed identity means the credentials will persist and can be repeatedly assigned to new and different resources in the dev/test environment.

26
Q

What are the 3 main parts of an Azure AD Access Token?

A

Header
Payload
Certificate

27
Q

Mention 4 setup components of an Azure AD Access Token and what do they mean?

A

Aud
Iss
Idp
Roles

The table below shows the claims that are in most ID tokens by default (except where noted). However, your app can use optional claims to request more claims in the ID token. Optional claims can range from the groups claim to information about the user’s name.

Claim Format Description
aud String, an App ID GUID Identifies the intended recipient of the token. In id_tokens, the audience is your app’s Application ID, assigned to your app in the Azure portal. This value should be validated. The token should be rejected if it fails to match your app’s Application ID.

iss String, an issuer URI Identifies the issuer, or “authorization server” that constructs and returns the token. It also identifies the Azure AD tenant for which the user was authenticated. If the token was issued by the v2.0 endpoint, the URI will end in /v2.0. The GUID that indicates that the user is a consumer user from a Microsoft account is 9188040d-6c67-4c5b-b112-36a304b66dad. Your app should use the GUID portion of the claim to restrict the set of tenants that can sign in to the app, if applicable.
iat int, a Unix timestamp “Issued At” indicates when the authentication for this token occurred.

idp String, usually an STS URI Records the identity provider that authenticated the subject of the token. This value is identical to the value of the Issuer claim unless the user account not in the same tenant as the issuer - guests, for instance. If the claim isn’t present, it means that the value of iss can be used instead. For personal accounts being used in an organizational context (for instance, a personal account invited to an Azure AD tenant), the idp claim may be ‘live.com’ or an STS URI containing the Microsoft account tenant 9188040d-6c67-4c5b-b112-36a304b66dad.

nbf int, a Unix timestamp The “nbf” (not before) claim identifies the time before which the JWT MUST NOT be accepted for processing.
exp int, a Unix timestamp The “exp” (expiration time) claim identifies the expiration time on or after which the JWT must not be accepted for processing. It’s important to note that in certain circumstances, a resource may reject the token before this time. For example, if a change in authentication is required or a token revocation has been detected.

c_hash String The code hash is included in ID tokens only when the ID token is issued with an OAuth 2.0 authorization code. It can be used to validate the authenticity of an authorization code. To understand how to do this validation, see the OpenID Connect specification.

at_hash String The access token hash is included in ID tokens only when the ID token is issued from the /authorize endpoint with an OAuth 2.0 access token. It can be used to validate the authenticity of an access token. To understand how to do this validation, see the OpenID Connect specification. This is not returned on ID tokens from the /token endpoint.

aio Opaque String An internal claim used by Azure AD to record data for token reuse. Should be ignored.

preferred_username String The primary username that represents the user. It could be an email address, phone number, or a generic username without a specified format. Its value is mutable and might change over time. Since it is mutable, this value must not be used to make authorization decisions. It can be used for username hints, however, and in human-readable UI as a username. The profile scope is required in order to receive this claim. Present only in v2.0 tokens.

email String The email claim is present by default for guest accounts that have an email address. Your app can request the email claim for managed users (those from the same tenant as the resource) using the email optional claim. On the v2.0 endpoint, your app can also request the email OpenID Connect scope - you don’t need to request both the optional claim and the scope to get the claim.

name String The name claim provides a human-readable value that identifies the subject of the token. The value isn’t guaranteed to be unique, it can be changed, and it’s designed to be used only for display purposes. The profile scope is required to receive this claim.
nonce String The nonce matches the parameter included in the original /authorize request to the IDP. If it does not match, your application should reject the token.

oid String, a GUID The immutable identifier for an object in the Microsoft identity system, in this case, a user account. This ID uniquely identifies the user across applications - two different applications signing in the same user will receive the same value in the oid claim. The Microsoft Graph will return this ID as the id property for a given user account. Because the oid allows multiple apps to correlate users, the profile scope is required to receive this claim. Note that if a single user exists in multiple tenants, the user will contain a different object ID in each tenant - they’re considered different accounts, even though the user logs into each account with the same credentials. The oid claim is a GUID and cannot be reused.

roles Array of strings The set of roles that were assigned to the user who is logging in.

rh Opaque String An internal claim used by Azure to revalidate tokens. Should be ignored.

sub String The principal about which the token asserts information, such as the user of an app. This value is immutable and cannot be reassigned or reused. The subject is a pairwise identifier - it is unique to a particular application ID. If a single user signs into two different apps using two different client IDs, those apps will receive two different values for the subject claim. This may or may not be wanted depending on your architecture and privacy requirements.
tid String, a GUID Represents the tenant that the user is signing in to. For work and school accounts, the GUID is the immutable tenant ID of the organization that the user is signing in to. For sign-ins to the personal Microsoft account tenant (services like Xbox, Teams for Life, or Outlook), the value is 9188040d-6c67-4c5b-b112-36a304b66dad.

unique_name String Only present in v1.0 tokens. Provides a human readable value that identifies the subject of the token. This value is not guaranteed to be unique within a tenant and should be used only for display purposes.

uti String Token identifier claim, equivalent to jti in the JWT specification. Unique, per-token identifier that is case-sensitive.
ver String, either 1.0 or 2.0 Indicates the version of the id_token.
hasgroups Boolean If present, always true, denoting the user is in at least one group. Used in place of the groups claim for JWTs in implicit grant flows if the full groups claim would extend the URI fragment beyond the URL length limits (currently 6 or more groups). Indicates that the client should use the Microsoft Graph API to determine the user’s groups (https://graph.microsoft.com/v1.0/users/{userID}/getMemberObjects).

groups:src1 JSON object For token requests that are not limited in length (see hasgroups above) but still too large for the token, a link to the full groups list for the user will be included. For JWTs as a distributed claim, for SAML as a new claim in place of the groups claim.

Example JWT Value:
“groups”:”src1”
“_claim_sources: “src1” : { “endpoint” : “https://graph.microsoft.com/v1.0/users/{userID}/getMemberObjects” }

For more info, see Groups overage claim.

The table below shows the claims that are in most ID tokens by default (

28
Q

What’s the difference between an ID Token and an Access Token?

A

Access tokens are defined in OAuth, ID tokens are defined in OpenID Connect.

Access tokensare what the OAuth client uses to make requests to an API. The access token is meant to be read and validated by the API. An ID token contains information about what happened when a user authenticated, and is intended to be read by the OAuth client. The ID token may also contain information about the user such as their name or email address, although that is not a requirement of an ID token.

Here are some further differences between ID tokens and access tokens:

ID tokens are meant to be read by the OAuth client. Access tokens are meant to be read by the resource server.

ID tokens areJWTs. Access tokenscan be JWTsbut may also be a random string.

ID tokens shouldneverbe sent to an API. Access tokens shouldneverbe read by the client.

29
Q

What is a JWT Refresh token?

A

An OAuth Refresh Token is a string that the OAuth client can use to get a new access token without the user’s interaction.

A refresh token must not allow the client to gain any access beyond the scope of the original grant. The refresh token exists to enable authorization servers to use short lifetimes for access tokens without needing to involve the user when the token expires.

30
Q

What is JWT?

A

JSON Web Token(JWT) es un estándar abierto (RFC-7519) basado en JSON para crear un token que sirva para enviar datos entre aplicaciones o servicios y garantizar que sean válidos y seguros.

El caso más común de uso de losJWTes para manejar la autenticación en aplicaciones móviles o web. Para esto cuando el usuario se quiere autenticar manda sus datos de inicio del sesión al servidor, este genera elJWTy se lo manda a la aplicación cliente, luego en cada petición el cliente envía este token que el servidor usa para verificar que el usuario este correctamente autenticado y saber quien es.

Este igual no es el único caso de uso paraJWT, es posible usarlo para transferir cualquier datos entre servicios de nuestra aplicación y asegurarnos de que sean siempre válido. Por ejemplo si tenemos un servicio de envío de email otro servicio podría enviar una petición con unJWTjunto al contenido del mail o cualquier otro dato necesario y que estemos seguros que esos datos no fueron alterados de ninguna forma.