ITExams Flashcards
A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF).
Which of the following code blocks creates this SQL UDF?
A.
CREATE FUNCTION combine_nyc
RETURNS STRING
RETURN CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;
B.
CREATE UDF combine_nyc
RETURNS STRING
CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;
C.
CREATE UDF combine_nyc
RETURN CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;
D.
CREATE FUNCTION combine_nyc
RETURN CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;
E.
CREATE UDF combine_nyc
RETURNS STRING
RETURN CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;
A.
CREATE FUNCTION combine
RETURNS STRING
RETURN CASE
WHEN city=”brooklyn” THEN “new york”
ELSE city
END;
(ITExams says E, but I say A)
https://docs.databricks.com/en/udf/index.html
Question 1
Which of the following commands can be used to write data into a Delta table while avoiding the writing of duplicate records?
A. DROP
B. IGNORE
C. MERGE
D. APPEND
E. INSERT
C. MERGE
Question 3
A data analyst has a series of queries in a SQL program. The data analyst wants this program to run every day. They only want the final query in the program to run on Sundays. They ask for help from the data engineering team to complete this task.
Which of the following approaches could be used by the data engineering team to complete this task?
A. They could submit a feature request with Databricks to add this functionality.
B. They could wrap the queries using PySpark and use Python’s control flow system to determine when to run the final query.
C. They could only run the entire program on Sundays.
D. They could automatically restrict access to the source table in the final query so that it is only accessible on Sundays.
E. They could redesign the data model to separate the data used in the final query into a new table.
B. They could wrap the queries using PySpark and use Python’s control flow system to determine when to run the final query.
Question 4
A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location “/transactions/raw”.
Today, the data engineer runs the following command to complete this task:
COPY INTO transactions
FROM “/transactions/raw”
FILEFORMAT = PARQUET;
After running the command today, the data engineer notices that the number of records in table transactions has not changed.
Which of the following describes why the statement might not have copied any new records into the table?
A. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
B. The names of the files to be copied were not included with the FILES keyword.
C. The previous day’s file has already been copied into the table.
D. The PARQUET file format does not support COPY INTO.
E. The COPY INTO statement requires the table to be refreshed to view the copied rows.
C. The previous day’s file has already been copied into the table.
https://docs.databricks.com/en/ingestion/copy-into/index.html
“The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and idempotent operation; files in the source location that have already been loaded are skipped.”
Bing Copilot:
One possible reason why the statement might not have copied any new records into the table is C. The previous day’s file has already been copied into the table.
According to the web search result 1, the COPY INTO command can load data from cloud object storage into a table in your Databricks workspace. The COPY INTO command supports idempotent loading, which means that it only loads files that have not been previously loaded into the target table1. This feature helps avoid data duplication and ensures exactly-once semantics1.
Therefore, if the previous day’s file has already been copied into the table, running the COPY INTO command again will not load any new records into the table. The COPY INTO command will skip the file that has already been loaded and return a message like this:
±——–±——–+ |numFiles |numCopied| ±——–±——–+ |1 |0 | ±——–±——–+
The other options are incorrect for the following reasons:
A. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword. This is not a valid keyword for the COPY INTO command. The correct way to specify the format of the files is to use the FILEFORMAT clause, which is already present in the statement1.
B. The names of the files to be copied were not included with the FILES keyword. This is not a valid keyword for the COPY INTO command. The correct way to specify the files to be copied is to use the FROM clause, which is already present in the statement1.
D. The PARQUET file format does not support COPY INTO. This is false, as PARQUET is one of the supported file formats for COPY INTO1.
E. The COPY INTO statement requires the table to be refreshed to view the copied rows. This is false, as COPY INTO does not require any additional steps to view the copied rows. The table will be automatically updated with the new data after running COPY INTO1.
Question 5
A data engineer needs to create a table in Databricks using data from their organization’s existing SQLite database.
They run the following command:
CREATE TABLE jdbc_customer360
USING ___
OPTIONS (
url “jdbc:sqllite:/customers.db”,
dbtable “customer360”
)
Which of the following lines of code fills in the above blank to successfully complete the task?
A. org.apache.spark.sql.jdbc
B. autoloader
C. DELTA
D. sqlite
E. org.apache.spark.sql.sqlite
According to ITExams: E. org.apache.spark.sql.sqlite
https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html
Bing Copilot:
The correct answer is A. org.apache.spark.sql.jdbc.
According to the web search results, org.apache.spark.sql.jdbc is the built-in JDBC data source that can be used to connect to various relational databases, such as SQLite, MySQL, PostgreSQL, etc123. To create a table in Databricks using data from a JDBC data source, you need to specify the USING clause with the value org.apache.spark.sql.jdbc and provide the OPTIONS clause with the connection properties, such as url, dbtable, user, password, etc123.
The other options are incorrect for the following reasons:
B. autoloader. This is not a valid data source for creating a table in Databricks. Autoloader is a feature that enables you to incrementally load files from cloud storage into Delta Lake tables using Structured Streaming4.
C. DELTA. This is not a valid data source for creating a table in Databricks using data from an SQLite database. DELTA is the data source for creating tables in Delta Lake format, which provides ACID transactions, scalable metadata handling, and unified batch and streaming processing5.
D. sqlite. This is not a valid data source for creating a table in Databricks. There is no built-in data source for SQLite in Spark SQL. You need to use the JDBC data source with the appropriate driver and connection properties6.
E. org.apache.spark.sql.sqlite. This is not a valid data source for creating a table in Databricks. There is no such package or class in Spark SQL that supports SQLite as a data source. You need to use the JDBC data source with the appropriate driver and connection properties6.
Question 6
A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables.
Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?
A. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
INNER JOIN SELECT * FROM april_transactions;
B. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
UNION SELECT * FROM april_transactions;
C. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
OUTER JOIN SELECT * FROM april_transactions;
D. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
INTERSECT SELECT * from april_transactions;
E. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
MERGE SELECT * FROM april_transactions;
B. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
UNION SELECT * FROM april_transactions;
Question 7
A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True.
Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?
A. if day_of_week = 1 and review_period:
B. if day_of_week = 1 and review_period = “True”:
C. if day_of_week == 1 and review_period == “True”:
D. if day_of_week == 1 and review_period:
E. if day_of_week = 1 & review_period: = “True”:
ITExam says C but I say D:
D. if day_of_week == 1 and review_period:
Question 8
A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table metadata and data. They run the following command:
DROP TABLE IF EXISTS my_table -
While the object no longer appears when they run SHOW TABLES, the data files still exist.
Which of the following describes why the data files still exist and the metadata files were deleted?
A. The table’s data was larger than 10 GB
B. The table’s data was smaller than 10 GB
C. The table was external
D. The table did not have a location
E. The table was managed
C. The table was external
Bing Copilot:
The most likely reason why the data files still exist and the metadata files were deleted is C. The table was external.
According to the web search result 2, an external table is a table that is created using the EXTERNAL keyword and points to an existing location in the file system. An external table does not manage the data files, but only stores the metadata information in the metastore database1. Therefore, when you drop an external table, only the metadata information is removed, but the data files are not deleted1.
On the other hand, a managed table is a table that is created without the EXTERNAL keyword and stores both the data files and the metadata information in a default location managed by Spark1. A managed table manages the data files, and deletes them when the table is dropped1.
Therefore, if you drop a table and only see the metadata files deleted, but not the data files, it means that the table was external and not managed by Spark. To delete both the metadata and the data files of an external table, you need to use the PURGE option in the DROP TABLE command1. For example:
DROP TABLE IF EXISTS my_table PURGE;
Question 9
A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location.
Which of the following data entities should the data engineer create?
A. Database
B. Function
C. View
D. Temporary view
E. Table
E. Table because it must be saved to a physical location
Question 10
A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level.
Which of the following tools can the data engineer use to solve this problem?
A. Unity Catalog
B. Data Explorer
C. Delta Lake
D. Delta Live Tables
E. Auto Loader
C according to ITExams but that is wrong. D. Delta Live Tables is correct.
Bing Copilot:
One possible tool that the data engineer can use to solve this problem is D. Delta Live Tables.
According to the web search result 1, Delta Live Tables is a feature that enables you to build reliable and scalable data pipelines using SQL or Python. Delta Live Tables allows you to define data quality constraints on the contents of a dataset using expectations. Expectations are clauses that apply data quality checks on each record passing through a query and specify an action to take when a record fails the expectation, such as warn, drop, or fail1.
By using Delta Live Tables, the data engineer can monitor the quality level of the source data and handle any invalid records according to their needs. Delta Live Tables also provides data quality metrics, such as the number of records that violate an expectation, that can be viewed by querying the Delta Live Tables event log1. Additionally, Delta Live Tables supports idempotent loading, which means that it only loads files that have not been previously loaded into the target table, ensuring exactly-once semantics1.
The other options are not suitable tools for solving this problem, for the following reasons:
A. Unity Catalog is a centralized metadata layer that provides a consistent view of all your data assets across different data sources2. Unity Catalog is not a tool for monitoring data quality or handling invalid records.
B. Data Explorer is a web-based interface that allows you to easily discover, explore, and visualize your data using SQL, Python, R, and Scala3. Data Explorer is not a tool for monitoring data quality or handling invalid records.
C. Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform4. Delta Lake provides features such as ACID transactions, schema enforcement, and time travel, but it does not provide built-in data quality checks or actions.
E. Auto Loader is a feature that enables you to incrementally load files from cloud storage into Delta Lake tables using Structured Streaming5. Auto Loader can help with data ingestion and processing, but it does not provide built-in data quality checks or actions.
Example:
CREATE OR REFRESH LIVE TABLE count_verification(
CONSTRAINT no_rows_dropped EXPECT (a_count == b_count)
) AS SELECT * FROM
(SELECT COUNT() AS a_count FROM LIVE.tbla),
(SELECT COUNT() AS b_count FROM LIVE.tblb)
https://docs.databricks.com/en/delta-live-tables/expectations.html#language-python
Question 11
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.
The table is configured to run in Production mode using the Continuous Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?
A. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.
B. All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.
C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
D. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
E. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.
According to ITExams: E. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.
But I say C along with Bing Copilot:
The correct answer is C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
According to the documentation1, a Delta Live Tables pipeline can run in two modes: triggered or continuous. In triggered mode, the pipeline runs once and then stops. In continuous mode, the pipeline runs continuously until it is manually stopped or encounters an error. The pipeline mode can be specified in the pipeline configuration or in the UI when creating or editing a pipeline.
When a pipeline runs in continuous mode, it updates all datasets at regular intervals based on the refresh interval setting. The refresh interval can be specified in seconds, minutes, hours, or days. The default refresh interval is 15 minutes. The refresh interval can be changed in the pipeline configuration or in the UI.
When a pipeline runs in production mode, it uses ephemeral clusters to execute the updates. Ephemeral clusters are clusters that are created and terminated by Delta Live Tables as needed. They are not visible in the Clusters page of the workspace UI. Ephemeral clusters help reduce costs and optimize resource utilization by only using compute resources when they are needed.
Therefore, if a Delta Live Tables pipeline is configured to run in production mode and continuous mode, it will update all datasets at set intervals until the pipeline is shut down, and it will use ephemeral clusters that will be deployed for the update and terminated when the pipeline is stopped.
Question 12
In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?
A. Checkpointing and Write-ahead Logs
B. Structured Streaming cannot record the offset range of the data being processed in each trigger.
C. Replayable Sources and Idempotent Sinks
D. Write-ahead Logs and Idempotent Sinks
E. Checkpointing and Idempotent Sinks
According to ITExams:
E. Checkpointing and Idempotent Sinks
Bing Copilot:
The correct answer is A. Checkpointing and Write-ahead Logs.
According to the documentation1, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. Checkpointing is a mechanism to save the state of a streaming query to a reliable storage system, such as HDFS or S3. Write-ahead logs are files that record the offsets and data of each batch before processing, so that they can be replayed in case of a failure. Together, these two techniques ensure that Structured Streaming can reliably track the exact progress of the processing and handle any kind of failure by restarting and/or reprocessing.
Replayable sources and idempotent sinks are also important concepts in Structured Streaming, but they are not used to record the offset range of the data being processed in each trigger. Replayable sources are sources that can replay the same data multiple times, such as Kafka or file-based sources. Idempotent sinks are sinks that can handle duplicate writes without changing the final result, such as file sinks or some database sinks. These features help Structured Streaming achieve end-to-end exactly-once semantics by ensuring that the same data is not processed or written more than once.
“In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the quick example) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query.” (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing)
Question 13
Which of the following describes the relationship between Gold tables and Silver tables?
A. Gold tables are more likely to contain aggregations than Silver tables.
B. Gold tables are more likely to contain valuable data than Silver tables.
C. Gold tables are more likely to contain a less refined view of data than Silver tables.
D. Gold tables are more likely to contain more data than Silver tables.
E. Gold tables are more likely to contain truthful data than Silver tables.
According to ITExams:
C. Gold tables are more likely to contain a less refined view of data than Silver tables.
This is for sure not correct.
Bing Copilot:
The correct answer is A. Gold tables are more likely to contain aggregations than Silver tables.
According to the web search results, Gold tables are the final layer of the medallion lakehouse architecture, which provide business-level aggregates often used for reporting and dashboarding1234. Gold tables can be fact and dimension tables, aggregate tables, or curated data sets5. Gold tables represent data that has been transformed into knowledge, rather than just information2.
Silver tables are the intermediate layer of the medallion lakehouse architecture, which provide a more refined view of the data from the Bronze layer. Silver tables are matched, merged, conformed and cleansed (“just-enough”) so that they can provide an “Enterprise view” of all its key business entities, concepts and transactions1. Silver tables enable self-service analytics for ad-hoc reporting, advanced analytics and ML1. Silver tables are also at the event/measurement level, but they are highly-refined and are ready for queries, reporting, dashboards etc5.
Therefore, Gold tables are more likely to contain aggregations than Silver tables, as they are designed to provide business-level insights and summaries from the data in the Silver layer.
After asking about option B:
Option B is incorrect. Gold tables are not more likely to contain valuable data than Silver tables. Both Gold and Silver tables contain valuable data, but they serve different purposes and audiences. Gold tables provide business-level aggregates that are often used for reporting and dashboarding, while Silver tables provide a more refined view of the data from the Bronze layer that enable self-service analytics for ad-hoc reporting, advanced analytics and ML. The value of the data depends on the use case and the user’s needs. Therefore, option B is not a valid way to describe the relationship between Gold and Silver tables.
Question 14
Which of the following describes the relationship between Bronze tables and raw data?
A. Bronze tables contain less data than raw data files.
B. Bronze tables contain more truthful data than raw data.
C. Bronze tables contain aggregates while raw data is unaggregated.
D. Bronze tables contain a less refined view of data than raw data.
E. Bronze tables contain raw data with a schema applied.
C according to Bing Copilot, but that is clearly wrong.
Bing Copilot:
The correct answer is E. Bronze tables contain raw data with a schema applied.
According to the web search results, Bronze tables are the first layer of the medallion lakehouse architecture, which store raw data ingested from various sources, such as RDBMS data, JSON files, IoT data, etc123. The table structures in this layer correspond to the source system table structures “as-is”, along with any additional metadata columns that capture the load date/time, process ID, etc1. The focus in this layer is quick Change Data Capture and the ability to provide an historical archive of source (cold storage), data lineage, auditability, reprocessing if needed without rereading the data from the source system1.
Bronze tables are stored in Delta Lake format, which is a storage layer that brings reliability and performance to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing4. Delta Lake also enables schema enforcement and evolution for diverse data types4. Therefore, Bronze tables contain raw data with a schema applied, either implicitly or explicitly.
The other options are incorrect for the following reasons:
A. Bronze tables contain less data than raw data files. This is false, as Bronze tables contain all the data from the raw data files, plus some additional metadata columns1.
B. Bronze tables contain more truthful data than raw data. This is false, as Bronze tables do not perform any transformations or validations on the raw data. They simply store the data as it is from the source systems1.
C. Bronze tables contain aggregates while raw data is unaggregated. This is false, as Bronze tables do not perform any aggregations on the raw data. They store the data at the event or measurement level4.
D. Bronze tables contain a less refined view of data than raw data. This is false, as Bronze tables provide a more refined view of data than raw data by applying a schema and storing the data in a reliable and performant format4.
Question 15
Which of the following tools is used by Auto Loader process data incrementally?
A. Checkpointing
B. Spark Structured Streaming
C. Data Explorer
D. Unity Catalog
E. Databricks SQL
B according to ITExams, and also according to Bing Copilot:
The correct answer is B. Spark Structured Streaming.
According to the web search results, Auto Loader provides a Structured Streaming source called cloudFiles123. Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine4. Structured Streaming allows you to express your streaming computation the same way you would express a batch computation on static data4. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive4.
The other options are incorrect for the following reasons:
A. Checkpointing is not a tool, but a mechanism to save the state of a streaming query to a reliable storage system4. Auto Loader uses checkpointing to track ingestion progress and ensure exactly-once semantics, but it is not the main tool to process data incrementally2.
C. Data Explorer is a web-based interface that allows you to easily discover, explore, and visualize your data using SQL, Python, R, and Scala5. Data Explorer is not related to Auto Loader or incremental data processing.
D. Unity Catalog is a centralized metadata layer that provides a consistent view of all your data assets across different data sources5. Unity Catalog is not related to Auto Loader or incremental data processing.
E. Databricks SQL is a service that enables you to run SQL queries on your data lake with low latency and high concurrency5. Databricks SQL supports Auto Loader as a data source, but it is not the tool that Auto Loader uses to process data incrementally3.