Dp_203 Flashcards
You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement.
You need to alter the table to meet the following requirements:
✑ Ensure that users can identify the current manager of employees.
✑ Support creating an employee reporting hierarchy for your entire company.
✑ Provide fast lookup of the managers’ attributes such as name and job title.
Which column should you add to the table?
A. [ManagerEmployeeID] [smallint] NULL
B. [ManagerEmployeeKey] [smallint] NULL
C. [ManagerEmployeeKey] [int] NULL
D. [ManagerName] varchar NULL
C. [ManagerEmployeeKey] [int] NULL
Explain: We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int column.
You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.
CREATE TABLE mytestdb.myParquetTable(
EmployeeID int,
EmployeeName string,
EmployeeStartDate date)
USING Parquet -
You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.
EmployeeName: Alice
EmployeeID: 24
EmployeeStartDate: 2020-01-25
One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.
SELECT EmployeeID -
FROM mytestdb.dbo.myParquetTable
WHERE EmployeeName = ‘Alice’;
What will be returned by the query?
A. 24
B. an error
C. a null value
An error
Explaination: Table names will be converted to lower case and need to be queried using the lower case name
Which role works with Azure Cognitive Services, Cognitive Search, and the Bot Framework?
(Azure Databricks)
A data engineer
A data scientist
An AI engineer
An AI engineer
Azure Databricks encapsulates which Apache Storage technology?
(Azure Databricks)
Apache HDInsight
Apache Hadoop
Apache Spark
Apache Spark
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure.
Which security features does Azure Databricks not support?
Azure Active Directory
Shared Access Keys
Role-based access
Shared Access Keys
Shared Access Keys are a security feature used within Azure storage accounts.
Which of the following Azure Databricks is used for support for R, SQL, Python, Scala, and Java?
MLlib
GraphX
Spark Core API
Spark Core API
Which Notebook format is used in Databricks?
DBC
.notebook
.spark
DBC
dbc file types are the supported Databricks notebook format. There is no no .notebook or .spark file format available.
Which browsers are recommended for best use with Databricks Notebook?
Chrome and Firefox
Microsoft Edge and IE 11
Safari and Microsoft Edge
Chrome and Firefox
Microsoft Edge and IE 11 are not recommended because of faulty rendering of iFrames, but Safari is also an acceptable browser.
How do you connect your Spark cluster to the Azure Blob?
By calling the .connect() function on the Spark Cluster.
By mounting it
By calling the .connect() function on the Azure Blob
By mounting it
Mounts require Azure credentials such as SAS keys and give access to a virtually infinite store for your data. The .connect() function is not a valid method.
How does Spark connect to databases like MySQL, Hive and other data stores?
JDBC
ODBC
Using the REST API Layer
JDBC
JDBC stands for Java Database Connectivity, and is a Java API for connecting to databases such as MySQL, Hive, and other data stores. ODBC is not an option and the REST API Layer is not available
How do you specify parameters when reading data?
Using .option() during your read allows you to pass key/value pairs specifying aspects of your read
Using .parameter() during your read allows you to pass key/value pairs specifying aspects of your read
Using .keys() during your read allows you to pass key/value pairs specifying aspects of your read
Using .option()
Using .option() during your read allows you to pass key/value pairs specifying aspects of your read. For instance, options for reading CSV data include header, delimiter, and inferSchema.
By default, how are corrupt records dealt with using spark.read.json()
They appear in a column called “_corrupt_record”
They get deleted automatically
They throw an exception and exit the read operation
They appear in a column called “_corrupt_record”
What is the recommended storage format to use with Spark?
JSON
XML
Apache Parquet
Apache Parquet
Apache Parquet is a highly optimized solution for data storage and is the recommended option for storage.
You need to develop a pipeline for processing data. The pipeline must meet the following requirements:
Scale up and down resources for cost reduction
Use an in-memory data processing engine to speed up ETL and machine learning operations.
Use streaming capabilities
Provide the ability to code in SQL, Python, Scala, and R
Integrate workspace collaboration with Git
What should you use?
HDInsight Spark Cluster
Azure Stream Analytics
HDInsight Hadoop Cluster
Azure SQL Data Warehouse
HDInsight Kafka Cluster
HDInsight Storm Cluster
HDInsight Spark Cluster
You plan to perform batch processing in Azure Databricks once daily.
Which type of Databricks cluster should you use?
job
interactive
High Concurrency
Job
You are a data engineer implementing a lambda architecture on Microsoft Azure. You use an open-source big data solution to collect, process, and maintain data. The analytical data store performs poorly.
You must implement a solution that meets the following requirements:
Provide data warehousing
Reduce ongoing management activities
Deliver SQL query responses in less than one second
You need to create an HDInsight cluster to meet the requirements.
Which type of cluster should you create?
Apache HBase
Apache Hadoop
Interactive Query
Apache Spark
Apache Spark
Apache Spark for Azure HDInsight, a processing framework that runs large-scale data analytics applications.
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch processing and stream processing methods, and minimizing the latency involved in querying big data.
You plan to perform batch processing in Azure Databricks once daily.
Which type of Databricks cluster should you use?
High Concurrency
interactive
automated
Automated
You use interactive clusters to analyze data collaboratively with interactive notebooks. You use automated clusters to run fast and robust automated jobs.
Your company plans to create an event processing engine to handle streaming data from Twitter.
The data engineering team uses Azure Event Hubs to ingest the streaming data.
You need to implement a solution that uses Azure Databricks to receive the streaming data from the Azure Event Hubs.
Which three actions should you recommend be performed in sequence?
A. Create and configure a Notebook that consumes the streaming data.
B. Import data from Blob storage
C. Use Environment variables to define the Apache Spark connection.
D. Configure the JDBC or ODBC connector
E. Deploy Azure Databricks service
F. Deploy a Spark cluster and then attach the required libraries to the cluster.
E-F-A
You are developing a solution using a Lambda architecture on Microsoft Azure.
The data at rest layer must meet the following requirements:
Data storage:
- Serve as a repository for high volumes of large files in various formats.
- Implement optimized storage for big data analytics workloads.
- Ensure that data can be organized using a hierarchical structure.
Batch processing:
- Use a managed solution for in-memory computation processing.
- Natively support Scala, Python, and R programming languages.
Provide the ability to resize and terminate the cluster automatically.
Analytical data store:
- Support parallel processing.
- Use columnar storage.
- Support SQL-based languages.
You need to identify the correct technologies to build the Lambda architecture.
Which technologies should you use?
Data Storage:
A. Azure SQL Databse
B. Azure Blob Storage
C. Azure Cosmo DB
D. Azure Data Lake
Batch Processing:
A. HDinsight Spark
B. HDinsight Hadoop
C. Azure Databricks
D. HDinsight Interactive Query
Analytical data store:
A. HDinsight Hbase
B. Azure SQL Data warehouse
C. Azure Analysis services
D. Azure Cosmo DB
D-A-B
A key mechanism that allows Azure Data Lake Storage Gen2 to provide file system performance at object storage scale and prices is the addition of a hierarchical namespace.
Aparch Spark is an open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications.
HDInsight is a managed Hadoop service. Use it deploy and manage Hadoop clusters in Azure. For batch processing, you can use Spark, Hive, Hive LLAP, MapReduce.
SQL Data Warehouse is a cloud-based Enterprise Data Warehouse (EDW) that uses Massively Parallel Processing (MPP).
SQL Data Warehouse stores data into relational tables with columnar storage.
You create an Azure Databricks cluster and specify an additional library to install.
When you attempt to load the library to a notebook, the library is not found.
You need to identify the cause of the issue.
What should you review?
workspace logs
notebook logs
global init scripts logs
cluster event logs
global init scripts logs
Init scripts are shell scripts that run during the startup of each cluster node before the Spark driver or worker JVM starts. Databricks customers use init scripts for various purposes such as installing custom libraries, launching background processes, or applying enterprise security policies.
You need to collect application metrics, streaming query events, and application log messages for an Azure Databricks cluster.
Which type of library and workspace should you implement?
Library:
A. Azure Databricks Monitoring Library
B. Azure Management Monitoring Library
C. PyTorch
D. TensorFlow
Workspace:
A. Azure Databricks
B. Azure Log Analytics
C. Azure Machine Learning
A-B
You can send application logs and metrics from Azure Databricks to a Log Analytics workspace. It uses the Azure Databricks Monitoring Library, which is available on GitHub.
You have an Azure Databricks resource.
You need to log actions that relate to compute changes triggered by the Databricks resources.
Which Databricks services should you log?
workspace
SSH
DBFS
clusters
jobs
Clusters
An Azure Databricks cluster is a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads.
Your company analyzes images from security cameras and sends alerts to security teams that respond to unusual activity. The solution uses Azure Databricks.
You need to send Apache Spark level events, Spark Structured Streaming metrics, and application metrics to Azure Monitor.
Which three actions should you perform in sequence?
A. Create a data source in Azure Monitor
B. Configure the Databricks cluster to use the databricks monitoring library.
C. Deploy Grafana to open an Azure VM.
D. Build a spark-listeners-loganalytics-1.0-SNAPSHOT.jar Jar file.
E. Create Dropwizard counters in the application code.
B-D-E
You can send application logs and metrics from Azure Databricks to a Log Analytics workspace.
Spark uses a configurable metrics system based on the Dropwizard Metrics Library.
Prerequisites: Configure your Azure Databricks cluster to use the monitoring library.
You plan to build a structured streaming solution in Azure Databricks. The solution will count new events in five-minute intervals and report only events that arrive during the interval. The output will be sent to a Delta Lake table.
Which output mode should you use?
complete
update
append
Append
Append Mode: Only new rows appended in the result table since the last trigger are written to external storage. This is applicable only for the queries where existing rows in the Result Table are not expected to change.
You have an Azure Data Lake Storage Gen2 account that contains JSON files for customers. The files contain two attributes named FirstName and LastName.
You need to copy the data from the JSON files to an Azure Synapse Analytics table by using Azure Databricks. A new column must be created that concatenates the FirstName and LastName values.
You create the following components:
- A destination table in Azure Synapse
- An Azure Blob storage container
- A service principal
Which five actions should you perform in sequence next in a Databricks notebook?
A. Specify a temporary folder to stage the data
B. Write the results to a table in Azure Synapse
C. Write the results to Data Lank Storage.
D. Drop the data frame
E. Perform transformations on the data frame
F. Mouth the Data Lake Storage onto DBFS
G. Perform transformations on the file
H. Read the file into a data frame
H-E-A-B-D
You are designing an Azure Databricks interactive cluster.
You need to ensure that the cluster meets the following requirements:
- Enable auto-termination
- Retain cluster configuration indefinitely after cluster termination.
What should you recommend?
Start the cluster after it is terminated.
Pin the cluster
Clone the cluster after it is terminated.
Terminate the cluster manually at process completion.
Pin the cluster
To keep an interactive cluster configuration even after it has been terminated for more than 30 days, an administrator can pin a cluster to the cluster list.
You are designing an Azure Databricks table. The table will ingest an average of 20 million streaming events per day.
You need to persist the events in the table for use in incremental load pipeline jobs in Azure Databricks. The solution must minimize storage costs and incremental load times.
What should you include in the solution?
Partition by DateTime fields.
Sink to Azure Queue storage.
Include a watermark column.
Use a JSON format for physical data storage.
Sink the Azure Queue Storage
The Databricks ABS-AQS connector uses Azure Queue Storage (AQS) to provide an optimized file source that lets you find new files written to an Azure Blob storage (ABS) container without repeatedly listing all of the files. This provides two major advantages:
- Lower latency: no need to list nested directory structures on ABS, which is slow and resource intensive.
- Lower costs: no more costly LIST API requests made to ABS.
You are designing an Azure Databricks cluster that runs user-defined local processes.
You need to recommend a cluster configuration that meets the following requirements:
- Minimize query latency.
- Maximize the number of users that can run queries on the cluster at the same time.
- Reduce overall costs without compromising other requirements.
Which cluster type should you recommend?
(Azure Databricks)
Standard with Auto Termination
High Concurrency with Autoscaling
High Concurrency with Auto Termination
Standard with Autoscaling
High Concurrency with Autoscaling.
The key benefits of High Concurrency clusters are that they provide fine-grained sharing for maximum resource utilization and minimum query latencies.
You are planning a solution to aggregate streaming data that originates in Apache Kafka and is output to Azure Data Lake Storage Gen2. The developers who will implement the stream processing solution use Java.
Which service should you recommend using to process the streaming data?
(Azure Databricks)
Azure Event Hubs
Azure Data Factory
Azure Stream Analytics
Azure Databricks
Azure Databricks
You need to implement an Azure Databricks cluster that automatically connects to Azure Data Lake Storage Gen2 by using Azure Active Directory (Azure AD) integration.
How should you configure the new cluster?
Tier:
A. Premium
B. Standard
Advanced option to enable:
A. Azure Data Lake Storage credential passthrough.
B. Table Access Control
Premium - Azure Data Lake storage Credential passthrough.
Credential passthrough requires an Azure Databricks Premium Plan
You create an Azure Databricks cluster and specify an additional library to install.
When you attempt to load the library to a notebook, the library in not found.
You need to identify the cause of the issue.
What should you review?
A. notebook logs
B. cluster event logs
C. global init scripts logs
D. workspace logs
cluster event logs
Cluster event logs capture cluster lifecycle events, like creation, termination, configuration edits, and so on.
Apache Spark driver and worker logs, which you can use for debugging.
Cluster init-script logs, valuable for debugging init scripts.
You are planning a streaming data solution that will use Azure Databricks. The solution will stream sales transaction data from an online store. The solution has the following specifications:
- The output data will contain items purchased, quantity, line total sales amount, and line total tax amount.
- Line total sales amount and line total tax amount will be aggregated in Databricks.
- Sales transactions will never be updated. Instead, new rows will be added to adjust a sale.
You need to recommend an output mode for the dataset that will be processed by using Structured Streaming. The solution must minimize duplicate data.
What should you recommend?
Update
Complete
Append
Append
Which Azure Data Factory component contains the transformation logic or the analysis commands of the Azure Data Factory’s work?
Linked Services
Datasets
Activities
Pipelines
Activities
Activities contains the transformation logic or the analysis commands of the Azure Data Factory’s work.
Linked Services are objects that are used to define the connection to data stores or compute resources in Azure.
Datasets represent data structures within the data store that is being referenced by the Linked Service object.
Pipelines are a logical grouping of activities.