Data Pipelines with Delta Live Tables and Spark SQL Flashcards

Question 1

Q

Which of the following correctly describes how code from one library notebook can be referenced by code from another library notebook? Select one response.

A) Within a DLT Pipeline, code in any notebook library can reference tables and views created in any other notebook library.
B) Within a DLT Pipeline, code in a notebook library can reference tables and views created in another notebook library as long as one notebook library references the other notebook library.
C) Within a DLT Pipeline, code in a notebook library can reference tables and views created in another notebook library that is running on the same cluster.
D) Within a DLT Pipeline, code in a notebook library can reference tables and views created in another notebook library as long as the referenced notebook library is installed on the other notebook library’s cluster.
E) Within a DLT Pipeline, code in notebook libraries cannot reference tables and views created in a different notebook library.

Answer

A

A) Within a DLT Pipeline, code in any notebook library can reference tables and views created in any other notebook library.

Question 2

Q

A data engineer needs to query a Delta Live Table (DLT) in a notebook. The notebook is not attached to a DLT pipeline.

Which of the following correctly describes the form of results that the query returns? Select one response.

A) Live queries outside of DLT will return snapshot results from DLT tables only if they were defined as a batch table.
B) Queries outside of DLT will return snapshot results from DLT tables only if they were defined as a streaming table
C) Queries outside of DLT will return the most recent version from DLT tables, regardless of how they were defined.
D) Queries outside of DLT will return the most recent version from DLT tables only if they were defined as a streaming table
E) Queries outside of DLT will return snapshot results from DLT tables, regardless of how they were defined.

Answer

A

E) Queries outside of DLT will return snapshot results from DLT tables, regardless of how they were defined.

Question 3

Q

A data engineer is using the code below to create a new table transactions_silver from the table transaction_bronze. However, when running the code, an error is thrown.

CREATE OR REFRESH STREAMING LIVE TABLE transactions_silver
(CONSTRAINT valid_date EXPECT (order_timestamp > “2022-01-01”) ON VIOLATION DROP ROW)
FROM LIVE.transactions_bronze

Which of the following statements correctly identifies the error and the stage at which the error was thrown? Select one response.

A) A SELECT statement needs to be added to create the columns for the transactions_silver table. The error will be detected during the Setting Up Tables stage.
B) LIVE.orders_bronze needs to be changed to STREAM(LIVE.orders_bronze). The error will be detected during the Setting Up Tables stage.
C) LIVE.orders_bronze needs to be changed to STREAM(LIVE.orders_bronze). The error will be detected during the Initializing stage.
D) The EXPECT statement needs to be changed to EXPECT (order_timestamp is NOT NULL). The error will be detected during the Setting Up Tables stage.
E) A SELECT statement needs to be added to create the columns for the transactions_silver table. The error will be detected during the Initializing stage.

Answer

A

E) A SELECT statement needs to be added to create the columns for the transactions_silver table. The error will be detected during the Initializing stage.

Question 4

Q

A data engineer needs to examine how data is flowing through tables within their pipeline.

Which of the following correctly describes how they can accomplish this? Select one response.

A) The data engineer can combine the flow definitions for all of the tables into one query.
B) The data engineer can view the flow definition of each table in the pipeline from the Pipeline Events log.
C) The data engineer can query the flow definition for the direct successor of the table and then combine the results.
D) The data engineer can query the flow definition for each table and then combine the results.
E) The data engineer can query the flow definition for the direct predecessor of each table and then combine the results.

Answer

A

E) The data engineer can query the flow definition for the direct predecessor of each table and then combine the results.

Question 5

Q

A data engineer has built and deployed a DLT pipeline. They want to see the output for each individual task.

Which of the following describes how to explore the output for each task in the pipeline? Select one response.

A) They can display the output for each individual command from within the notebook using the %run command.
B) They can go to the Pipeline Details page and click on the individual tables in the resultant Directed Acyclic Graph (DAG).
C) They can run the commands connected to each task from within the DLT notebook after deploying the pipeline.
D) They can go to the Job Runs page and click on the individual tables in the job run history.
E) They can specify a folder for the task run details during pipeline configuration.

Answer

A

B) They can go to the Pipeline Details page and click on the individual tables in the resultant Directed Acyclic Graph (DAG).

Question 6

Q

A data engineer has created the following query to create a streaming live table from transactions.

Code block:
CREATE OR REFRESH STREAMING LIVE TABLE transactions
AS SELECT timestamp(transaction_timestamp) AS transaction_timestamp, * EXCEPT (transaction_timestamp, source)
________________________

Which of the following lines of code correctly fills in the blank? Select two responses.

A) FROM STREAMING LIVE.transactions
B) FROM LIVE.transactions
C) FROM STREAMING LIVE (transactions)
D) FROM DELTA STREAM(LIVE.transactions)
E) FROM STREAM(LIVE.transactions)

Answer

A

B) FROM LIVE.transactions
E) FROM STREAM(LIVE.transactions)

Question 7

Q

A data engineer has a Delta Live Tables (DLT) pipeline that uses a change data capture (CDC) data source. They need to write a quality enforcement rule that ensures that values in the column operation do not contain null values. If the constraint is violated, the associated records cannot be included in the dataset.

Which of the following constraints does the data engineer need to use to enforce this rule? Select two responses.

A) CONSTRAINT valid_operation EXCEPT (operation) ON VIOLATION DROP ROW
B) CONSTRAINT valid_operation EXCEPT (operation) ON VIOLATION DROP ROW
C) CONSTRAINT valid_operation EXPECT (operation IS NOT NULL) ON VIOLATION DROP ROW
D) CONSTRAINT valid_operation ON VIOLATION FAIL UPDATE
E) CONSTRAINT valid_operation EXCEPT (operation != null) ON VIOLATION FAIL UPDATE
CONSTRAINT valid_operation EXPECT (operation IS NOT NULL)

Answer

A

C) CONSTRAINT valid_operation EXPECT (operation IS NOT NULL) ON VIOLATION DROP ROW

Question 8

Q

Which of the following are advantages of using a Delta Live Tables (DLT) pipeline over a traditional ETL pipeline in Databricks? Select two responses.

A) DLT provides granular observability into pipeline operations and automatic error handling.
B) DLT leverages additional metadata over other open source formats such as JSON, CSV, and Parquet.
C) DLT has built-in quality controls and data quality monitoring.
D) DLT decouples compute and storage costs regardless of scale.
E) DLT automates data management through physical data optimizations and schema evolution.

Answer

A

A) DLT provides granular observability into pipeline operations and automatic error handling.
C) DLT has built-in quality controls and data quality monitoring.

Question 9

Q

Which of the following correctly describes how to access contents of the table directory? Select one response.

A) The contents of the table directory can be viewed through the event log
B) The contents of the table directory can be viewed through the Auto Loader directory.
C) The contents of the table directory can be viewed through the flow definition’s output dataset.
D) The contents of the table directory can be viewed through the checkpointing directory.
E) The contents of the table directory can be viewed through the metastore.

Answer

A

E) The contents of the table directory can be viewed through the metastore.

Question 10

Q

A data engineer is creating a live streaming table to be used by other members of their team. They want to indicate that the table contains silver quality data.

Which of the following describes how the data engineer can clarify this to other members of their team? Select two responses.

A) COMMENT “This is a silver table”
B) WHEN QUALITY = SILVER THEN PASS
C) EXPECT QUALITY = SILVER
D) TBLPROPERTIES (“quality” = “silver”)
E) None of these answer choices are correct.

Answer

A

A) COMMENT “This is a silver table”
D) TBLPROPERTIES (“quality” = “silver”)

Question 11

Q

A data engineer wants to query metrics on the latest update made to their pipeline. The pipeline has multiple data sources. Despite the input data sources having low data retention, the data engineer needs to retain the results of the query indefinitely.

Which of the following statements identifies the type of table that needs to be used and why? Select one response.

A) Live table; live tables only support reading from “append-only” streaming sources.
B) Live table; live tables retain the results of a query for up to 30 days.
C) Streaming live table; streaming live tables record live metrics on the query.
D) Streaming live table; streaming live tables can preserve data indefinitely.
E) Streaming live table; streaming live tables are always “correct”, meaning their contents will match their definition after any update.

Answer

A

D) Streaming live table; streaming live tables can preserve data indefinitely.

Question 12

Q

Which of the following are guaranteed when processing a change data capture (CDC) feed with APPLY CHANGES INTO? Select three responses.

A) APPLY CHANGES INTO assumes by default that rows will contain inserts and updates.
B) APPLY CHANGES INTO automatically quarantines late-arriving data in a separate table.
C) APPLY CHANGES INTO defaults to creating a Type 1 SCD table.
D) APPLY CHANGES INTO supports insert-only and append-only data.
E) APPLY CHANGES INTO automatically orders late-arriving records using a user-provided sequencing key.

Answer

A

A) APPLY CHANGES INTO assumes by default that rows will contain inserts and updates.
C) APPLY CHANGES INTO defaults to creating a Type 1 SCD table.
E) APPLY CHANGES INTO automatically orders late-arriving records using a user-provided sequencing key.

Question 13

Q

A data engineer needs to add a file path to their DLT pipeline. They want to use the file path throughout the pipeline as a parameter for various statements and functions.

Which of the following options can be specified during the configuration of a DLT pipeline in order to allow this? Select one response.

A) They can add a key-value pair in the Configurations field and then perform a string substitution of the file path.
B) They can add a widget to the notebook and then perform a string substitution of the file path.
C) They can specify the file path in the job scheduler when deploying the pipeline.
D) They can add a parameter when scheduling the pipeline job and then perform a variable substitution of the file path.
E) They can set the variable in a notebook command and then perform a variable substitution of the file path.

Answer

A

A) They can add a key-value pair in the Configurations field and then perform a string substitution of the file path.

Question 14

Q

A data engineer has built and deployed a DLT pipeline. They want to perform an update that writes a batch of data to the output directory.

Which of the following statements about performing this update is true? Select one response.

A) With each triggered update, all newly arriving data will be processed through their pipeline. Metrics will not be reported for the current run.
B) All newly arriving data will be continuously processed through their pipeline. Metrics will be reported for the current run if specified during pipeline deployment.
C) With each triggered update, all newly arriving data will be processed through their pipeline. Metrics will always be reported for the current run.
D) With each triggered update, all newly arriving data will be processed through their pipeline. Metrics will be reported if specified during pipeline deployment.
E) All newly arriving data will be continuously processed through their pipeline. Metrics will always be reported for the current run.

Answer

A

C) With each triggered update, all newly arriving data will be processed through their pipeline. Metrics will always be reported for the current run.

Question 15

Q

Which of the following data quality metrics are captured through row_epectations in a pipeline’s event log? Select three responses.

A) Flow progress
B) Dataset
C) Failed records
D) Update ID
E) Name

Answer

A

B) Dataset
C) Failed records
E) Name

Question 16

Q

A data engineer needs to review the events related to their pipeline and the pipeline’s configurations.

Which of the following approaches can the data engineer take to accomplish this? Select one response.

A) The data engineer can query events of type user_action from the checkpoint directory.
B) The data engineer can query events of type user_action from the configured storage location.
C) The data engineer can query events of type user_action from the event log.
D) The data engineer can select events of type user_action in the output table of the pipeline.
E) The data engineer can select events of type user_action in the resultant DAG.

Answer

A

C) The data engineer can query events of type user_action from the event log.

Question 17

Q

A data engineer has configured and deployed a DLT pipeline that contains an error. The error is thrown at the third stage of the pipeline, but since DLT resolves the order of tables in the pipeline at different steps, they are not sure if the first stage succeeded.

Which of the following is considered a good practice to determine this? Select one response.

A) The data engineer can fix the tables from the Directed Acyclic Graph (DAG), starting at the dataset containing the error.
B) The data engineer can fix one table at a time, starting at their earliest dataset.
C) The data engineer can fix the tables using iterative logic, starting at their earliest dataset.
D) The data engineer can fix the tables using iterative logic, starting at the dataset containing the error.
E) The data engineer can fix the tables from the Directed Acyclic Graph (DAG), starting at their earliest dataset.

Answer

A

B) The data engineer can fix one table at a time, starting at their earliest dataset.

Question 18

Q

A data engineer has a Delta Live Tables (DLT) pipeline that uses a change data capture (CDC) data source. They need to write a quality enforcement rule that ensures that records containing the values INSERT or UPDATE in the operation column cannot contain a null value in the name column. The operation column can contain one of three values: INSERT, UPDATE, and DELETE. If the constraint is violated, then the entire transaction needs to fail.

Which of the following constraints can the data engineer use to enforce this rule? Select one response.

A) CONSTRAINT valid_operation EXPECT (valid_operation IS NOT NULL and valid_operation = “INSERT”) ON VIOLATION DROP ROW
B) CONSTRAINT valid_name EXPECT (name IS NOT NULL or operation = “DELETE”) ON VIOLATION DROP ROW
C) CONSTRAINT valid_id_not_null EXPECT (valid_id IS NOT NULL or operation = “INSERT”)ON VIOLATION FAIL UPDATE
D) CONSTRAINT valid_operation EXPECT (operation IS NOT NULL) ON VIOLATION DROP ROW
E) CONSTRAINT valid_name EXPECT (name IS NOT NULL or operation = “DELETE”) ON VIOLATION FAIL UPDATE

Answer

A

E) CONSTRAINT valid_name EXPECT (name IS NOT NULL or operation = “DELETE”) ON VIOLATION FAIL UPDATE

Question 19

Q

A data engineer is running a Delta Live Tables (DLT) notebook. They notice that several commands display the following message:

This Delta Live Tables query is syntactically valid, but you must create a pipeline in order to define and populate your table.

Which of the following statements explains this message? Select one response.

A) DLT queries must be connected to a pipeline using the pipeline scheduler.
B) DLT does not support the execution of Python commands.
C) DLT notebooks must be run at scheduled intervals using the job scheduler.
D) DLT is not intended for interactive execution in a notebook.
E) DLT does not support the execution of Python and SQL notebooks within a single pipeline.

Answer

A

D) DLT is not intended for interactive execution in a notebook.

Question 20

Q

A data engineer needs to ensure the table updated_history, which is derived from the table history, contains all records from history. Each record in both tables contains a value for the column user_id.

Which of the following approaches can the data engineer use to create a new data object from updated_history and history containing the records with matching user_id values in both tables? Select one response.

A) The data engineer can create a new temporary view by querying the history and updated_history tables.
B) The data engineer can create a new view by joining the history and updated_history tables.
C) The data engineer can merge the history and updated_history tables on user_id.
D) The data engineer can create a new common table expression from the history table that queries the updated_history table.
E) The data engineer can create a new dynamic view by querying the history and updated_history tables.
Score: 0.00

Answer

A

B) The data engineer can create a new view by joining the history and updated_history tables.

Question 21

Q

Which of the following correctly describes how Auto Loader ingests data? Select one response.

A) Auto Loader automatically detects new data files during manual or scheduled updates.
B) Auto Loader writes new data files in incremental batches.
D) Auto Loader only detects new data files during scheduled update intervals.
D) Auto Loader incrementally ingests new data files in batches.
E) Auto Loader automatically writes new data files continuously as they land.

Answer

A

D) Auto Loader incrementally ingests new data files in batches.

Question 22

Q

Which of the following statements accurately describes the difference in behavior between live views and live tables? Select one response.

A) Live tables can be used to enforce data quality, while views do not have the same guarantees in schema enforcement.
B) Metrics for live tables can be collected and reported, while data quality metrics for views are abstracted to the user.
C) The results of live tables are stored to disk, while the results of views can only be referenced from within the DLT pipeline in which they are defined.
D) Live tables can be used with a stream as its source, while live views are incompatible with structured streaming.
E) The results of live tables can be viewed through a Directed Acyclic Graph (DAG), while the results for live views cannot.

Answer

A

C) The results of live tables are stored to disk, while the results of views can only be referenced from within the DLT pipeline in which they are defined.

Question 23

Q

A data engineer is configuring a new DLT pipeline and is unsure what mode to choose. They are working with a small batch of unchanging data and need to minimize the costs associated with the pipeline.

Which of the following modes do they need to use and why? Select one response

A) Triggered; triggered pipelines update once and cannot be updated again for 24 hours.
B) Continuous; continuous pipelines run at set intervals and then shut down until the next manual or scheduled update.
C) Triggered; triggered pipelines run once and then shut down until the next manual or scheduled update.
D) Continuous; continuous pipelines ingest new data as it arrives.
E) Triggered; triggered pipelines update once and cannot be updated again until they are manually run.

Answer

A

C) Triggered; triggered pipelines run once and then shut down until the next manual or scheduled update.

Question 24

Q

A data engineer needs to identify the cloud provider and region of origin for each event within their DLT pipeline.

Which of the following approaches allows the data engineer to view this information? Select one response.

A) The data engineer can view this information in the Task Details page for each task in the pipeline.
B) The data engineer can view the event details for the pipeline from the resultant Directed Acyclic Graph (DAG).
C) The data engineer can use a utility command in Python to list information about each update made to a particular data object.
D) The data engineer can use a SELECT query to directly query the cloud_details field of the event.
E) The data engineer can load the contents of the event log into a view and display the view.

Answer

A

E) The data engineer can load the contents of the event log into a view and display the view.

Question 25

Q

A data engineer wants to query metrics on the latest update made to their pipeline. They need to be able to see the event type and timestamp for each update.

Which of the following approaches allows the data engineer to complete this task? Select one response.

A) The data engineer can query the update ID from the events log where the event type is create_update.
B) The data engineer can query the update ID from the events log where the action type is user_action.
C) The data engineer can view the update ID from the Pipeline Details page in the user_action table.
D) The data engineer can view the update ID from the Pipeline Details page in the create_update table.
E) The data engineer can query the update ID from the events log where the event type is last_update.

Answer

A

A) The data engineer can query the update ID from the events log where the event type is create_update.