ITExams Flashcards

1
Q

A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF).

Which of the following code blocks creates this SQL UDF?

A.
CREATE FUNCTION combine_nyc
RETURNS STRING
RETURN CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;

B.
CREATE UDF combine_nyc
RETURNS STRING
CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;

C.
CREATE UDF combine_nyc
RETURN CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;

D.
CREATE FUNCTION combine_nyc
RETURN CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;

E.
CREATE UDF combine_nyc
RETURNS STRING
RETURN CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;

A

A.
CREATE FUNCTION combine
RETURNS STRING
RETURN CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;

(ITExams says E, but I say A)

https://docs.databricks.com/en/udf/index.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Question 1

Which of the following commands can be used to write data into a Delta table while avoiding the writing of duplicate records?

A. DROP
B. IGNORE
C. MERGE
D. APPEND
E. INSERT

A

C. MERGE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Question 3

A data analyst has a series of queries in a SQL program. The data analyst wants this program to run every day. They only want the final query in the program to run on Sundays. They ask for help from the data engineering team to complete this task.

Which of the following approaches could be used by the data engineering team to complete this task?

A. They could submit a feature request with Databricks to add this functionality.
B. They could wrap the queries using PySpark and use Python’s control flow system to determine when to run the final query.
C. They could only run the entire program on Sundays.
D. They could automatically restrict access to the source table in the final query so that it is only accessible on Sundays.
E. They could redesign the data model to separate the data used in the final query into a new table.

A

B. They could wrap the queries using PySpark and use Python’s control flow system to determine when to run the final query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Question 4

A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location “/transactions/raw”.

Today, the data engineer runs the following command to complete this task:

COPY INTO transactions
FROM “/transactions/raw”
FILEFORMAT = PARQUET;

After running the command today, the data engineer notices that the number of records in table transactions has not changed.

Which of the following describes why the statement might not have copied any new records into the table?

A. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
B. The names of the files to be copied were not included with the FILES keyword.
C. The previous day’s file has already been copied into the table.
D. The PARQUET file format does not support COPY INTO.
E. The COPY INTO statement requires the table to be refreshed to view the copied rows.

A

C. The previous day’s file has already been copied into the table.

https://docs.databricks.com/en/ingestion/copy-into/index.html

“The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and idempotent operation; files in the source location that have already been loaded are skipped.”

Bing Copilot:

One possible reason why the statement might not have copied any new records into the table is C. The previous day’s file has already been copied into the table.

According to the web search result 1, the COPY INTO command can load data from cloud object storage into a table in your Databricks workspace. The COPY INTO command supports idempotent loading, which means that it only loads files that have not been previously loaded into the target table1. This feature helps avoid data duplication and ensures exactly-once semantics1.

Therefore, if the previous day’s file has already been copied into the table, running the COPY INTO command again will not load any new records into the table. The COPY INTO command will skip the file that has already been loaded and return a message like this:

±——–±——–+ |numFiles |numCopied| ±——–±——–+ |1 |0 | ±——–±——–+

The other options are incorrect for the following reasons:

A. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword. This is not a valid keyword for the COPY INTO command. The correct way to specify the format of the files is to use the FILEFORMAT clause, which is already present in the statement1.
B. The names of the files to be copied were not included with the FILES keyword. This is not a valid keyword for the COPY INTO command. The correct way to specify the files to be copied is to use the FROM clause, which is already present in the statement1.
D. The PARQUET file format does not support COPY INTO. This is false, as PARQUET is one of the supported file formats for COPY INTO1.
E. The COPY INTO statement requires the table to be refreshed to view the copied rows. This is false, as COPY INTO does not require any additional steps to view the copied rows. The table will be automatically updated with the new data after running COPY INTO1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Question 5

A data engineer needs to create a table in Databricks using data from their organization’s existing SQLite database.

They run the following command:

CREATE TABLE jdbc_customer360
USING ___
OPTIONS (
url “jdbc:sqllite:/customers.db”,
dbtable “customer360”
)

Which of the following lines of code fills in the above blank to successfully complete the task?

A. org.apache.spark.sql.jdbc
B. autoloader
C. DELTA
D. sqlite
E. org.apache.spark.sql.sqlite

A

According to ITExams: E. org.apache.spark.sql.sqlite

https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html

Bing Copilot:

The correct answer is A. org.apache.spark.sql.jdbc.

According to the web search results, org.apache.spark.sql.jdbc is the built-in JDBC data source that can be used to connect to various relational databases, such as SQLite, MySQL, PostgreSQL, etc123. To create a table in Databricks using data from a JDBC data source, you need to specify the USING clause with the value org.apache.spark.sql.jdbc and provide the OPTIONS clause with the connection properties, such as url, dbtable, user, password, etc123.

The other options are incorrect for the following reasons:

B. autoloader. This is not a valid data source for creating a table in Databricks. Autoloader is a feature that enables you to incrementally load files from cloud storage into Delta Lake tables using Structured Streaming4.
C. DELTA. This is not a valid data source for creating a table in Databricks using data from an SQLite database. DELTA is the data source for creating tables in Delta Lake format, which provides ACID transactions, scalable metadata handling, and unified batch and streaming processing5.
D. sqlite. This is not a valid data source for creating a table in Databricks. There is no built-in data source for SQLite in Spark SQL. You need to use the JDBC data source with the appropriate driver and connection properties6.
E. org.apache.spark.sql.sqlite. This is not a valid data source for creating a table in Databricks. There is no such package or class in Spark SQL that supports SQLite as a data source. You need to use the JDBC data source with the appropriate driver and connection properties6.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Question 6

A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables.
Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?

A. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
INNER JOIN SELECT * FROM april_transactions;

B. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
UNION SELECT * FROM april_transactions;

C. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
OUTER JOIN SELECT * FROM april_transactions;

D. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
INTERSECT SELECT * from april_transactions;

E. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
MERGE SELECT * FROM april_transactions;

A

B. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
UNION SELECT * FROM april_transactions;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Question 7

A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True.

Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?

A. if day_of_week = 1 and review_period:
B. if day_of_week = 1 and review_period = “True”:
C. if day_of_week == 1 and review_period == “True”:
D. if day_of_week == 1 and review_period:
E. if day_of_week = 1 & review_period: = “True”:

A

ITExam says C but I say D:

D. if day_of_week == 1 and review_period:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Question 8

A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table metadata and data. They run the following command:

DROP TABLE IF EXISTS my_table -

While the object no longer appears when they run SHOW TABLES, the data files still exist.

Which of the following describes why the data files still exist and the metadata files were deleted?

A. The table’s data was larger than 10 GB
B. The table’s data was smaller than 10 GB
C. The table was external
D. The table did not have a location
E. The table was managed

A

C. The table was external

Bing Copilot:

The most likely reason why the data files still exist and the metadata files were deleted is C. The table was external.

According to the web search result 2, an external table is a table that is created using the EXTERNAL keyword and points to an existing location in the file system. An external table does not manage the data files, but only stores the metadata information in the metastore database1. Therefore, when you drop an external table, only the metadata information is removed, but the data files are not deleted1.

On the other hand, a managed table is a table that is created without the EXTERNAL keyword and stores both the data files and the metadata information in a default location managed by Spark1. A managed table manages the data files, and deletes them when the table is dropped1.

Therefore, if you drop a table and only see the metadata files deleted, but not the data files, it means that the table was external and not managed by Spark. To delete both the metadata and the data files of an external table, you need to use the PURGE option in the DROP TABLE command1. For example:

DROP TABLE IF EXISTS my_table PURGE;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Question 9

A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location.

Which of the following data entities should the data engineer create?

A. Database
B. Function
C. View
D. Temporary view
E. Table

A

E. Table because it must be saved to a physical location

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Question 10

A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level.

Which of the following tools can the data engineer use to solve this problem?

A. Unity Catalog
B. Data Explorer
C. Delta Lake
D. Delta Live Tables
E. Auto Loader

A

C according to ITExams but that is wrong. D. Delta Live Tables is correct.

Bing Copilot:

One possible tool that the data engineer can use to solve this problem is D. Delta Live Tables.

According to the web search result 1, Delta Live Tables is a feature that enables you to build reliable and scalable data pipelines using SQL or Python. Delta Live Tables allows you to define data quality constraints on the contents of a dataset using expectations. Expectations are clauses that apply data quality checks on each record passing through a query and specify an action to take when a record fails the expectation, such as warn, drop, or fail1.

By using Delta Live Tables, the data engineer can monitor the quality level of the source data and handle any invalid records according to their needs. Delta Live Tables also provides data quality metrics, such as the number of records that violate an expectation, that can be viewed by querying the Delta Live Tables event log1. Additionally, Delta Live Tables supports idempotent loading, which means that it only loads files that have not been previously loaded into the target table, ensuring exactly-once semantics1.

The other options are not suitable tools for solving this problem, for the following reasons:

A. Unity Catalog is a centralized metadata layer that provides a consistent view of all your data assets across different data sources2. Unity Catalog is not a tool for monitoring data quality or handling invalid records.
B. Data Explorer is a web-based interface that allows you to easily discover, explore, and visualize your data using SQL, Python, R, and Scala3. Data Explorer is not a tool for monitoring data quality or handling invalid records.
C. Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform4. Delta Lake provides features such as ACID transactions, schema enforcement, and time travel, but it does not provide built-in data quality checks or actions.
E. Auto Loader is a feature that enables you to incrementally load files from cloud storage into Delta Lake tables using Structured Streaming5. Auto Loader can help with data ingestion and processing, but it does not provide built-in data quality checks or actions.

Example:

CREATE OR REFRESH LIVE TABLE count_verification(
CONSTRAINT no_rows_dropped EXPECT (a_count == b_count)
) AS SELECT * FROM
(SELECT COUNT() AS a_count FROM LIVE.tbla),
(SELECT COUNT(
) AS b_count FROM LIVE.tblb)

https://docs.databricks.com/en/delta-live-tables/expectations.html#language-python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Question 11

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.
The table is configured to run in Production mode using the Continuous Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

A. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.
B. All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.
C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
D. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
E. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

A

According to ITExams: E. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

But I say C along with Bing Copilot:

The correct answer is C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

According to the documentation1, a Delta Live Tables pipeline can run in two modes: triggered or continuous. In triggered mode, the pipeline runs once and then stops. In continuous mode, the pipeline runs continuously until it is manually stopped or encounters an error. The pipeline mode can be specified in the pipeline configuration or in the UI when creating or editing a pipeline.

When a pipeline runs in continuous mode, it updates all datasets at regular intervals based on the refresh interval setting. The refresh interval can be specified in seconds, minutes, hours, or days. The default refresh interval is 15 minutes. The refresh interval can be changed in the pipeline configuration or in the UI.

When a pipeline runs in production mode, it uses ephemeral clusters to execute the updates. Ephemeral clusters are clusters that are created and terminated by Delta Live Tables as needed. They are not visible in the Clusters page of the workspace UI. Ephemeral clusters help reduce costs and optimize resource utilization by only using compute resources when they are needed.

Therefore, if a Delta Live Tables pipeline is configured to run in production mode and continuous mode, it will update all datasets at set intervals until the pipeline is shut down, and it will use ephemeral clusters that will be deployed for the update and terminated when the pipeline is stopped.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Question 12

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

A. Checkpointing and Write-ahead Logs
B. Structured Streaming cannot record the offset range of the data being processed in each trigger.
C. Replayable Sources and Idempotent Sinks
D. Write-ahead Logs and Idempotent Sinks
E. Checkpointing and Idempotent Sinks

A

According to ITExams:

E. Checkpointing and Idempotent Sinks

Bing Copilot:

The correct answer is A. Checkpointing and Write-ahead Logs.

According to the documentation1, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. Checkpointing is a mechanism to save the state of a streaming query to a reliable storage system, such as HDFS or S3. Write-ahead logs are files that record the offsets and data of each batch before processing, so that they can be replayed in case of a failure. Together, these two techniques ensure that Structured Streaming can reliably track the exact progress of the processing and handle any kind of failure by restarting and/or reprocessing.

Replayable sources and idempotent sinks are also important concepts in Structured Streaming, but they are not used to record the offset range of the data being processed in each trigger. Replayable sources are sources that can replay the same data multiple times, such as Kafka or file-based sources. Idempotent sinks are sinks that can handle duplicate writes without changing the final result, such as file sinks or some database sinks. These features help Structured Streaming achieve end-to-end exactly-once semantics by ensuring that the same data is not processed or written more than once.

“In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the quick example) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query.” (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Question 13
Which of the following describes the relationship between Gold tables and Silver tables?

A. Gold tables are more likely to contain aggregations than Silver tables.
B. Gold tables are more likely to contain valuable data than Silver tables.
C. Gold tables are more likely to contain a less refined view of data than Silver tables.
D. Gold tables are more likely to contain more data than Silver tables.
E. Gold tables are more likely to contain truthful data than Silver tables.

A

According to ITExams:

C. Gold tables are more likely to contain a less refined view of data than Silver tables.

This is for sure not correct.

Bing Copilot:

The correct answer is A. Gold tables are more likely to contain aggregations than Silver tables.

According to the web search results, Gold tables are the final layer of the medallion lakehouse architecture, which provide business-level aggregates often used for reporting and dashboarding1234. Gold tables can be fact and dimension tables, aggregate tables, or curated data sets5. Gold tables represent data that has been transformed into knowledge, rather than just information2.

Silver tables are the intermediate layer of the medallion lakehouse architecture, which provide a more refined view of the data from the Bronze layer. Silver tables are matched, merged, conformed and cleansed (“just-enough”) so that they can provide an “Enterprise view” of all its key business entities, concepts and transactions1. Silver tables enable self-service analytics for ad-hoc reporting, advanced analytics and ML1. Silver tables are also at the event/measurement level, but they are highly-refined and are ready for queries, reporting, dashboards etc5.

Therefore, Gold tables are more likely to contain aggregations than Silver tables, as they are designed to provide business-level insights and summaries from the data in the Silver layer.

After asking about option B:

Option B is incorrect. Gold tables are not more likely to contain valuable data than Silver tables. Both Gold and Silver tables contain valuable data, but they serve different purposes and audiences. Gold tables provide business-level aggregates that are often used for reporting and dashboarding, while Silver tables provide a more refined view of the data from the Bronze layer that enable self-service analytics for ad-hoc reporting, advanced analytics and ML. The value of the data depends on the use case and the user’s needs. Therefore, option B is not a valid way to describe the relationship between Gold and Silver tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Question 14

Which of the following describes the relationship between Bronze tables and raw data?

A. Bronze tables contain less data than raw data files.
B. Bronze tables contain more truthful data than raw data.
C. Bronze tables contain aggregates while raw data is unaggregated.
D. Bronze tables contain a less refined view of data than raw data.
E. Bronze tables contain raw data with a schema applied.

A

C according to Bing Copilot, but that is clearly wrong.

Bing Copilot:

The correct answer is E. Bronze tables contain raw data with a schema applied.

According to the web search results, Bronze tables are the first layer of the medallion lakehouse architecture, which store raw data ingested from various sources, such as RDBMS data, JSON files, IoT data, etc123. The table structures in this layer correspond to the source system table structures “as-is”, along with any additional metadata columns that capture the load date/time, process ID, etc1. The focus in this layer is quick Change Data Capture and the ability to provide an historical archive of source (cold storage), data lineage, auditability, reprocessing if needed without rereading the data from the source system1.

Bronze tables are stored in Delta Lake format, which is a storage layer that brings reliability and performance to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing4. Delta Lake also enables schema enforcement and evolution for diverse data types4. Therefore, Bronze tables contain raw data with a schema applied, either implicitly or explicitly.

The other options are incorrect for the following reasons:

A. Bronze tables contain less data than raw data files. This is false, as Bronze tables contain all the data from the raw data files, plus some additional metadata columns1.
B. Bronze tables contain more truthful data than raw data. This is false, as Bronze tables do not perform any transformations or validations on the raw data. They simply store the data as it is from the source systems1.
C. Bronze tables contain aggregates while raw data is unaggregated. This is false, as Bronze tables do not perform any aggregations on the raw data. They store the data at the event or measurement level4.
D. Bronze tables contain a less refined view of data than raw data. This is false, as Bronze tables provide a more refined view of data than raw data by applying a schema and storing the data in a reliable and performant format4.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Question 15

Which of the following tools is used by Auto Loader process data incrementally?

A. Checkpointing
B. Spark Structured Streaming
C. Data Explorer
D. Unity Catalog
E. Databricks SQL

A

B according to ITExams, and also according to Bing Copilot:

The correct answer is B. Spark Structured Streaming.

According to the web search results, Auto Loader provides a Structured Streaming source called cloudFiles123. Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine4. Structured Streaming allows you to express your streaming computation the same way you would express a batch computation on static data4. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive4.

The other options are incorrect for the following reasons:

A. Checkpointing is not a tool, but a mechanism to save the state of a streaming query to a reliable storage system4. Auto Loader uses checkpointing to track ingestion progress and ensure exactly-once semantics, but it is not the main tool to process data incrementally2.

C. Data Explorer is a web-based interface that allows you to easily discover, explore, and visualize your data using SQL, Python, R, and Scala5. Data Explorer is not related to Auto Loader or incremental data processing.

D. Unity Catalog is a centralized metadata layer that provides a consistent view of all your data assets across different data sources5. Unity Catalog is not related to Auto Loader or incremental data processing.

E. Databricks SQL is a service that enables you to run SQL queries on your data lake with low latency and high concurrency5. Databricks SQL supports Auto Loader as a data source, but it is not the tool that Auto Loader uses to process data incrementally3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Question 1 ( Exam A )

A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame.
Which of the following describes how a data lakehouse could alleviate this issue?

A. Both teams would autoscale their work as data size evolves
B. Both teams would use the same source of truth for their work
C. Both teams would reorganize to report to the same department
D. Both teams would be able to collaborate on projects in real-time
E. Both teams would respond more quickly to ad-hoc requests

A

B. Both teams would use the same source of truth for their work

17
Q

Question 2 ( Exam A )

Which of the following describes a scenario in which a data team will want to utilize cluster pools?

A. An automated report needs to be refreshed as quickly as possible.
B. An automated report needs to be made reproducible.
C. An automated report needs to be tested to identify errors.
D. An automated report needs to be version-controlled across multiple collaborators.
E. An automated report needs to be runnable by all stakeholders.

A

E according to ITExams but Bing Copilot says:

The correct answer is A. An automated report needs to be refreshed as quickly as possible.

According to the web search results, cluster pools are a set of idle, ready-to-use instances that can reduce cluster start and auto-scaling times123. Cluster pools can help improve the performance and reliability of automated reports that run on a regular schedule or on demand. By using cluster pools, the data team can avoid the overhead of provisioning new instances every time a report needs to be refreshed. Cluster pools can also help save costs by using spot instances or terminating idle instances after a certain period of time23.

The other options are incorrect for the following reasons:

B. An automated report needs to be made reproducible. This is not a scenario in which cluster pools are useful, as cluster pools do not affect the reproducibility of the report. Reproducibility depends on factors such as the data sources, the code logic, the runtime environment, and the version control system.
C. An automated report needs to be tested to identify errors. This is not a scenario in which cluster pools are useful, as cluster pools do not help with testing or debugging the report. Testing and debugging depend on tools such as notebooks, logging, monitoring, and error handling.
D. An automated report needs to be version-controlled across multiple collaborators. This is not a scenario in which cluster pools are useful, as cluster pools do not help with version control or collaboration. Version control and collaboration depend on tools such as Git, Databricks Repos, and Databricks Workspace .
E. An automated report needs to be runnable by all stakeholders. This is not a scenario in which cluster pools are useful, as cluster pools do not affect the accessibility or security of the report. Accessibility and security depend on factors such as permissions, roles, authentication, and encryption .

18
Q

Question 3 ( Exam A )

Which of the following is hosted completely in the control plane of the classic Databricks architecture?

A. Worker node
B. JDBC data source
C. Databricks web application
D. Databricks Filesystem
E. Driver node

A

E according to ITExams but Bing Copilot says:

The correct answer is C. Databricks web application.

According to the web search results, the Databricks web application is the user interface that allows you to create and manage workspaces, notebooks, clusters, jobs, and other resources12. The Databricks web application is hosted completely in the control plane of the classic Databricks architecture, which is the management plane where Databricks runs the workspace application and manages notebooks, configuration and clusters1.

The other options are incorrect for the following reasons:

A. Worker node is a compute resource that runs tasks as part of a cluster2. Worker nodes are hosted in the data plane of the classic Databricks architecture, which is where your data is processed by clusters of compute resources2.
B. JDBC data source is an external data source that can be connected to Databricks clusters using the JDBC protocol3. JDBC data sources are not hosted in the control plane or the data plane of the classic Databricks architecture, but in your own or third-party cloud account.
D. Databricks Filesystem (DBFS) is a distributed file system that lets you access data stored in cloud storage as if they were local files2. DBFS is hosted in the data plane of the classic Databricks architecture, as it stores your data at rest in your own cloud account2.
E. Driver node is a compute resource that coordinates distributed execution and runs driver programs as part of a cluster2. Driver nodes are hosted in the data plane of the classic Databricks architecture, as they handle your data processing by communicating with worker nodes2.

19
Q

Question 4 ( Exam A )

Which of the following benefits of using the Databricks Lakehouse Platform is provided by Delta Lake?

A. The ability to manipulate the same data using a variety of languages
B. The ability to collaborate in real time on a single notebook
C. The ability to set up alerts for query failures
D. The ability to support batch and streaming workloads
E. The ability to distribute complex data operations

A

D. The ability to support batch and streaming workloads

Bing Copilot agrees:

The correct answer is D. The ability to support batch and streaming workloads.

According to the web search results, Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform1. Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale1. Delta Lake is the default storage format for all operations on Azure Databricks1.

The other options are incorrect for the following reasons:

A. The ability to manipulate the same data using a variety of languages. This is not a benefit of Delta Lake, but a benefit of Apache Spark, which is the unified analytics engine that powers the Databricks Lakehouse Platform2. Apache Spark supports multiple languages, such as Python, Scala, SQL, R, and Java2.
B. The ability to collaborate in real time on a single notebook. This is not a benefit of Delta Lake, but a benefit of Databricks Workspace, which is the collaborative environment that allows you to create and manage workspaces, notebooks, clusters, jobs, and other resources3. Databricks Workspace enables real-time collaboration on notebooks with features such as comments, revisions, and access control3.
C. The ability to set up alerts for query failures. This is not a benefit of Delta Lake, but a benefit of Databricks SQL Analytics, which is a service that enables you to run SQL queries on your data lake with low latency and high concurrency3. Databricks SQL Analytics allows you to set up alerts for query failures using email or webhooks3.
E. The ability to distribute complex data operations. This is not a benefit of Delta Lake, but a benefit of Apache Spark, which is the unified analytics engine that powers the Databricks Lakehouse Platform2. Apache Spark can distribute complex data operations across multiple nodes in a cluster using parallel processing and fault tolerance techniques2.

20
Q

Question 5 ( Exam A )

Which of the following describes the storage organization of a Delta table?

A. Delta tables are stored in a single file that contains data, history, metadata, and other attributes.
B. Delta tables store their data in a single file and all metadata in a collection of files in a separate location.
C. Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes.
D. Delta tables are stored in a collection of files that contain only the data stored within the table.
E. Delta tables are stored in a single file that contains only the data stored within the table.

A

C. Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes.

21
Q

Question 7 ( Exam A )

A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted.
Which of the following explains why the data files are no longer present?

A. The VACUUM command was run on the table
B. The TIME TRAVEL command was run on the table
C. The DELETE HISTORY command was run on the table
D. The OPTIMIZE command was nun on the table
E. The HISTORY command was run on the table

A

A. The VACUUM command was run on the table

22
Q

Question 6 ( Exam A )

Which of the following code blocks will remove the rows where the value in column age is greater than 25 from the existing Delta table my_table and save the updated table?

A. SELECT * FROM my_table WHERE age > 25;
B. UPDATE my_table WHERE age > 25;
C. DELETE FROM my_table WHERE age > 25;
D. UPDATE my_table WHERE age <= 25;
E. DELETE FROM my_table WHERE age <= 25;

A

C. DELETE FROM my_table WHERE age > 25;

23
Q

Question 7 ( Exam A )

A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted.

Which of the following explains why the data files are no longer present?

A. The VACUUM command was run on the table
B. The TIME TRAVEL command was run on the table
C. The DELETE HISTORY command was run on the table
D. The OPTIMIZE command was nun on the table
E. The HISTORY command was run on the table

A

C according to ITExams but A according to Bing Copilot:

The most likely explanation for why the data files are no longer present is A. The VACUUM command was run on the table.

According to the web search results, the VACUUM command can remove data files that are no longer referenced by a Delta table and are older than the retention threshold12. The default retention threshold for data files is 7 days1. Therefore, if the VACUUM command was run on the table, it might have deleted the data files that were needed for time travel to the older version.

The other options are incorrect for the following reasons:

B. The TIME TRAVEL command was run on the table. This is not a valid command in Delta Lake. The correct way to query a table at a specific point in time is to use the VERSION AS OF or TIMESTAMP AS OF clauses3.
C. The DELETE HISTORY command was run on the table. This is not a valid command in Delta Lake. The correct way to retrieve the history of a table is to use the DESCRIBE HISTORY command3. This command does not delete any data files or affect time travel.
D. The OPTIMIZE command was run on the table. This command can improve the performance of a table by coalescing small files into larger ones3. However, this command does not delete any data files or affect time travel.
E. The HISTORY command was run on the table. This is not a valid command in Delta Lake. The correct way to retrieve the history of a table is to use the DESCRIBE HISTORY command3. This command does not delete any data files or affect time travel.

24
Q

Question 8 ( Exam A )

Which of the following Git operations must be performed outside of Databricks Repos?

A. Commit
B. Pull
C. Push
D. Clone
E. Merge

A

D according to ITExams but none according to Bing Copilot:

None of the Git operations listed must be performed outside of Databricks Repos. Databricks Repos supports common Git operations such as commit, pull, push, clone, and merge within the Databricks UI or API123. You can access the Git dialog from a notebook or from the Databricks Repos browser and perform these operations using buttons or commands2. You can also use the Repos REST API to integrate data projects into CI/CD pipelines3.

However, some Git operations are not supported by Databricks Repos, such as:

  • Forking a repository
  • Creating or deleting a remote repository
  • Managing pull requests
  • Configuring Git hooks
  • Using SSH keys for authentication

These operations must be performed outside of Databricks Repos using your Git provider’s interface or tools2.

25
Q

Question 9 ( Exam A )

Which of the following data lakehouse features results in improved data quality over a traditional data lake?

A. A data lakehouse provides storage solutions for structured and unstructured data.
B. A data lakehouse supports ACID-compliant transactions.
C. A data lakehouse allows the use of SQL queries to examine data.
D. A data lakehouse stores data in open formats.
E. A data lakehouse enables machine learning and artificial Intelligence workloads.

A

C according to ITExams but B according to Bing Copilot:

The most likely answer is B. A data lakehouse supports ACID-compliant transactions.

According to the web search results, ACID-compliant transactions are a data management feature that ensure data integrity and consistency by enforcing four properties: atomicity, consistency, isolation, and durability12. A data lakehouse supports ACID-compliant transactions by using a metadata layer, such as Delta Lake, that tracks which files are part of different table versions and enables features like schema enforcement, data validation, and time travel12. These features help improve data quality over a traditional data lake, which lacks such capabilities and can result in data silos, data corruption, or data loss12.

The other options are not directly related to data quality, but rather to other benefits of a data lakehouse, such as:

A. A data lakehouse provides storage solutions for structured and unstructured data. This is a benefit of a data lakehouse that enables flexibility, scalability, and cost-efficiency by using cloud object storage to store a broader range of data types32.
C. A data lakehouse allows the use of SQL queries to examine data. This is a benefit of a data lakehouse that enables performance and reliability for BI and advanced analytics by using new query engine designs that provide high-performance SQL execution on data lakes12.
D. A data lakehouse stores data in open formats. This is a benefit of a data lakehouse that enables compatibility and accessibility for data science and machine learning tools by using open file formats, such as Parquet, that can be easily read by popular tools like pandas, TensorFlow, PyTorch, etc12.
E. A data lakehouse enables machine learning and artificial intelligence workloads. This is a benefit of a data lakehouse that enables innovation and insights for organizations by using the same data for both BI and ML projects without needing to access multiple systems32.

26
Q

Question 10 ( Exam A )

A data engineer needs to determine whether to use the built-in Databricks Notebooks versioning or version their project using Databricks Repos.

Which of the following is an advantage of using Databricks Repos over the Databricks Notebooks versioning?

A. Databricks Repos automatically saves development progress
B. Databricks Repos supports the use of multiple branches
C. Databricks Repos allows users to revert to previous versions of a notebook
D. Databricks Repos provides the ability to comment on specific changes
E. Databricks Repos is wholly housed within the Databricks Lakehouse Platform

A

B according to both ITExams and Bing Copilot:

One advantage of using Databricks Repos over the Databricks Notebooks versioning is B. Databricks Repos supports the use of multiple branches.

According to the web search results, Databricks Repos is a visual Git client and API in Azure Databricks that supports common Git operations such as cloning, committing, pushing, pulling, branch management, and visual comparison of diffs12. Using Databricks Repos, you can develop code in notebooks or other files and follow data science and engineering code development best practices using Git for version control, collaboration, and CI/CD12.

Databricks Notebooks versioning is a legacy feature that allows you to link a notebook to a remote Git repository and sync changes using the Revision History menu3. However, this feature has some limitations compared to Databricks Repos, such as:

It does not support the use of multiple branches. You can only link a notebook to one branch in a repository. Databricks recommends using a separate branch for each notebook3.
It does not support the use of arbitrary files. You can only link notebooks to a remote Git repository. Databricks Repos supports any file type that can be stored in a Git repository2.
It does not support the use of IPYNB notebooks. You can only link notebooks in the DBC or SOURCE format. Databricks Repos supports IPYNB notebooks as well as other formats2.
It does not support the use of the Repos REST API. You can only sync changes using the UI. Databricks Repos provides an API that you can integrate with your CI/CD pipeline2.
Therefore, if you need to use multiple branches or any of the other features mentioned above, you should use Databricks Repos instead of Databricks Notebooks versioning.

27
Q

Question 11 ( Exam A )

A data engineer has left the organization. The data team needs to transfer ownership of the data engineer’s Delta tables to a new data engineer. The new data engineer is the lead engineer on the data team.

Assuming the original data engineer no longer has access, which of the following individuals must be the one to transfer ownership of the Delta tables in Data Explorer?

A. Databricks account representative
B. This transfer is not possible
C. Workspace administrator
D. New lead data engineer
E. Original data engineer

A

D according to ITExam but C according to Bing Copilot:

The individual who must be the one to transfer ownership of the Delta tables in Data Explorer is C. Workspace administrator.

According to the web search result 1, only a workspace administrator can transfer ownership of a SQL warehouse, query, dashboard, or alert using the workspace UI, Databricks CLI, or Permissions REST API. A SQL warehouse is a logical grouping of one or more clusters that can access a set of tables and views1. A SQL warehouse can contain Delta tables, which are tables stored in Delta Lake format that provide ACID transactions, scalable metadata handling, and unified batch and streaming processing2.

Therefore, to transfer ownership of the data engineer’s Delta tables, the data team needs to ask a workspace administrator to assign a new owner to the SQL warehouse that contains those tables. The workspace administrator can follow the steps described in the web search result 1 to perform this task using the UI or the API. The new owner must have the Allow cluster creation entitlement1.

28
Q

Question 12 ( Exam A )

A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests to ensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL.

Which of the following commands could the data engineering team use to access sales in PySpark?

A. SELECT * FROM sales
B. There is no way to share data between PySpark and SQL.
C. spark.sql(“sales”)
D. spark.delta.table(“sales”)
E. spark.table(“sales”)

A

D according to ITExams but that is clearly wrong. According to Bing Copilot:

The command that the data engineering team could use to access sales in PySpark is E. spark.table(“sales”).

According to the web search result 2, Delta tables can be accessed from PySpark using the spark.table() function, which returns a DataFrame that represents the table. For example, the following code loads the Delta table sales into a DataFrame called df:

df = spark.table(“sales”)

The other options are incorrect for the following reasons:

A. SELECT * FROM sales. This is a SQL query, not a PySpark command. To execute SQL queries from PySpark, you need to use the spark.sql() function, which returns a DataFrame that contains the result of the query1.

B. There is no way to share data between PySpark and SQL. This is false, as PySpark and SQL can both access the same data sources, such as Delta tables, using different APIs2. You can also register DataFrames as temporary views and query them using SQL1.

C. spark.sql(“sales”). This is not a valid SQL query, as it does not specify any columns or clauses. To load a table using spark.sql(), you need to use a valid query, such as SELECT * FROM sales1.

D. spark.delta.table(“sales”). This is not a valid PySpark function, as there is no delta attribute in the SparkSession object. To load a Delta table using PySpark, you need to use either spark.table() or spark.read.format(“delta”).load()2.

https://sparkbyexamples.com/spark/spark-spark-table-vs-spark-read-table/

29
Q

Question 13 ( Exam A )

Which of the following commands will return the location of database customer360?

A. DESCRIBE LOCATION customer360;
B. DROP DATABASE customer360;
C. DESCRIBE DATABASE customer360;
D. ALTER DATABASE customer360 SET DBPROPERTIES (‘location’ = ‘/user’};
E. USE DATABASE customer360;

A

C according to both ITExams and Bing Copilot:

The correct answer is C. DESCRIBE DATABASE customer360.

According to the web search result 1, the DESCRIBE DATABASE command returns the metadata of an existing database, such as the database name, description, location, and properties. For example, the following command returns the location of the database customer360:

DESCRIBE DATABASE customer360;

The output of this command will look something like this:

Database Name: customer360 Description: Location: dbfs:/customer/customer360 Properties: ()

The other options are incorrect for the following reasons:

A. DESCRIBE LOCATION customer360. This is not a valid command in Spark SQL. There is no LOCATION keyword in the DESCRIBE statement1.
B. DROP DATABASE customer360. This command will delete the database customer360 and all its tables, not return its location1.
D. ALTER DATABASE customer360 SET DBPROPERTIES (‘location’ = ‘/user’). This command will change the location property of the database customer360 to ‘/user’, not return its current location1.
E. USE DATABASE customer360. This command will set the current database to customer360, not return its location1.

30
Q

Question 14 ( Exam A )

A data engineer wants to create a new table containing the names of customers that live in France.
They have written the following command:

CREATE TABLE customersInFrance
___ AS
SELECT id,
firstName,
lastName
FROM customerLocations
WHERE country = ‘France’

A senior data engineer mentions that it is organization policy to include a table property indicating that the new table includes personally identifiable information (PII).
Which of the following lines of code fills in the above blank to successfully complete the task?

A. There is no way to indicate whether a table contains PII.
B. “COMMENT PII”
C. TBLPROPERTIES PII
D. COMMENT “Contains PII”
E. PII

A

C according to ITExams, but that is clearly not correct as it then should include a key-value pair in brackets.

Bing Copilot:

The correct answer is D. COMMENT “Contains PII”.

According to the web search result 2, a table property is a key-value pair that can be initialized when creating a table or a view using the TBLPROPERTIES clause. A table property can be used to tag tables with information not tracked by SQL, such as user-defined metadata. For example, the following command creates a table with two user-defined properties:

CREATE TABLE test (c1 INT) TBLPROPERTIES (‘key1’ = ‘value1’, ‘key2’ = ‘value2’);

According to the web search result 1, a table comment is a string literal that can be initialized when creating a table or a view using the COMMENT clause. A table comment can be used to provide a description of the table or its contents. For example, the following command creates a table with a comment:

CREATE TABLE test (c1 INT) COMMENT ‘This is a test table’;

Therefore, to indicate that a table contains PII, it is more appropriate to use a comment than a property, as a comment is meant to describe the table, while a property is meant to store additional information. The syntax for adding a comment to a table is COMMENT followed by a string literal enclosed in quotation marks. Hence, the correct line of code to fill in the blank is COMMENT “Contains PII”.

https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html

CREATE TABLE person_names
COMMENT “Only the names of persons”
AS SELECT first_name, last_name
FROM persons;

31
Q

Question 15 ( Exam A )

Which of the following benefits is provided by the array functions from Spark SQL?

A. An ability to work with data in a variety of types at once
B. An ability to work with data within certain partitions and windows
C. An ability to work with time-related data in specified intervals
D. An ability to work with complex, nested data ingested from JSON files
E. An ability to work with an array of tables for procedural automation

A

B according to ITExams but that is clearly wrong. Bing Copilot:

One of the benefits of using the array functions from Spark SQL is D. An ability to work with complex, nested data ingested from JSON files.

According to the web search result 3, Spark SQL provides built-in standard array functions that can be used to manipulate and transform array columns. These functions can be useful when working with complex, nested data ingested from JSON files, as JSON data often contains arrays of objects or values. For example, you can use the array function to create an array column from multiple columns, the explode function to split an array column into multiple rows, or the array_contains function to check if an array column contains a specific value1. You can also use other array functions to perform operations such as sorting, filtering, joining, removing, or aggregating array elements1.

32
Q

Question 16 ( Exam A )

Which of the following commands can be used to write data into a Delta table while avoiding the writing of duplicate records?

A. DROP
B. IGNORE
C. MERGE
D. APPEND
E. INSERT

A

C. MERGE

33
Q
A