Data Pipelines with Delta Live Tables and Spark SQL Flashcards
Which of the following correctly describes how code from one library notebook can be referenced by code from another library notebook? Select one response.
A) Within a DLT Pipeline, code in any notebook library can reference tables and views created in any other notebook library.
B) Within a DLT Pipeline, code in a notebook library can reference tables and views created in another notebook library as long as one notebook library references the other notebook library.
C) Within a DLT Pipeline, code in a notebook library can reference tables and views created in another notebook library that is running on the same cluster.
D) Within a DLT Pipeline, code in a notebook library can reference tables and views created in another notebook library as long as the referenced notebook library is installed on the other notebook library’s cluster.
E) Within a DLT Pipeline, code in notebook libraries cannot reference tables and views created in a different notebook library.
A) Within a DLT Pipeline, code in any notebook library can reference tables and views created in any other notebook library.
A data engineer needs to query a Delta Live Table (DLT) in a notebook. The notebook is not attached to a DLT pipeline.
Which of the following correctly describes the form of results that the query returns? Select one response.
A) Live queries outside of DLT will return snapshot results from DLT tables only if they were defined as a batch table.
B) Queries outside of DLT will return snapshot results from DLT tables only if they were defined as a streaming table
C) Queries outside of DLT will return the most recent version from DLT tables, regardless of how they were defined.
D) Queries outside of DLT will return the most recent version from DLT tables only if they were defined as a streaming table
E) Queries outside of DLT will return snapshot results from DLT tables, regardless of how they were defined.
E) Queries outside of DLT will return snapshot results from DLT tables, regardless of how they were defined.
A data engineer is using the code below to create a new table transactions_silver from the table transaction_bronze. However, when running the code, an error is thrown.
CREATE OR REFRESH STREAMING LIVE TABLE transactions_silver
(CONSTRAINT valid_date EXPECT (order_timestamp > “2022-01-01”) ON VIOLATION DROP ROW)
FROM LIVE.transactions_bronze
Which of the following statements correctly identifies the error and the stage at which the error was thrown? Select one response.
A) A SELECT statement needs to be added to create the columns for the transactions_silver table. The error will be detected during the Setting Up Tables stage.
B) LIVE.orders_bronze needs to be changed to STREAM(LIVE.orders_bronze). The error will be detected during the Setting Up Tables stage.
C) LIVE.orders_bronze needs to be changed to STREAM(LIVE.orders_bronze). The error will be detected during the Initializing stage.
D) The EXPECT statement needs to be changed to EXPECT (order_timestamp is NOT NULL). The error will be detected during the Setting Up Tables stage.
E) A SELECT statement needs to be added to create the columns for the transactions_silver table. The error will be detected during the Initializing stage.
E) A SELECT statement needs to be added to create the columns for the transactions_silver table. The error will be detected during the Initializing stage.
A data engineer needs to examine how data is flowing through tables within their pipeline.
Which of the following correctly describes how they can accomplish this? Select one response.
A) The data engineer can combine the flow definitions for all of the tables into one query.
B) The data engineer can view the flow definition of each table in the pipeline from the Pipeline Events log.
C) The data engineer can query the flow definition for the direct successor of the table and then combine the results.
D) The data engineer can query the flow definition for each table and then combine the results.
E) The data engineer can query the flow definition for the direct predecessor of each table and then combine the results.
E) The data engineer can query the flow definition for the direct predecessor of each table and then combine the results.
A data engineer has built and deployed a DLT pipeline. They want to see the output for each individual task.
Which of the following describes how to explore the output for each task in the pipeline? Select one response.
A) They can display the output for each individual command from within the notebook using the %run command.
B) They can go to the Pipeline Details page and click on the individual tables in the resultant Directed Acyclic Graph (DAG).
C) They can run the commands connected to each task from within the DLT notebook after deploying the pipeline.
D) They can go to the Job Runs page and click on the individual tables in the job run history.
E) They can specify a folder for the task run details during pipeline configuration.
B) They can go to the Pipeline Details page and click on the individual tables in the resultant Directed Acyclic Graph (DAG).
A data engineer has created the following query to create a streaming live table from transactions.
Code block:
CREATE OR REFRESH STREAMING LIVE TABLE transactions
AS SELECT timestamp(transaction_timestamp) AS transaction_timestamp, * EXCEPT (transaction_timestamp, source)
________________________
Which of the following lines of code correctly fills in the blank? Select two responses.
A) FROM STREAMING LIVE.transactions
B) FROM LIVE.transactions
C) FROM STREAMING LIVE (transactions)
D) FROM DELTA STREAM(LIVE.transactions)
E) FROM STREAM(LIVE.transactions)
B) FROM LIVE.transactions
E) FROM STREAM(LIVE.transactions)
A data engineer has a Delta Live Tables (DLT) pipeline that uses a change data capture (CDC) data source. They need to write a quality enforcement rule that ensures that values in the column operation do not contain null values. If the constraint is violated, the associated records cannot be included in the dataset.
Which of the following constraints does the data engineer need to use to enforce this rule? Select two responses.
A) CONSTRAINT valid_operation EXCEPT (operation) ON VIOLATION DROP ROW
B) CONSTRAINT valid_operation EXCEPT (operation) ON VIOLATION DROP ROW
C) CONSTRAINT valid_operation EXPECT (operation IS NOT NULL) ON VIOLATION DROP ROW
D) CONSTRAINT valid_operation ON VIOLATION FAIL UPDATE
E) CONSTRAINT valid_operation EXCEPT (operation != null) ON VIOLATION FAIL UPDATE
CONSTRAINT valid_operation EXPECT (operation IS NOT NULL)
C) CONSTRAINT valid_operation EXPECT (operation IS NOT NULL) ON VIOLATION DROP ROW
Which of the following are advantages of using a Delta Live Tables (DLT) pipeline over a traditional ETL pipeline in Databricks? Select two responses.
A) DLT provides granular observability into pipeline operations and automatic error handling.
B) DLT leverages additional metadata over other open source formats such as JSON, CSV, and Parquet.
C) DLT has built-in quality controls and data quality monitoring.
D) DLT decouples compute and storage costs regardless of scale.
E) DLT automates data management through physical data optimizations and schema evolution.
A) DLT provides granular observability into pipeline operations and automatic error handling.
C) DLT has built-in quality controls and data quality monitoring.
Which of the following correctly describes how to access contents of the table directory? Select one response.
A) The contents of the table directory can be viewed through the event log
B) The contents of the table directory can be viewed through the Auto Loader directory.
C) The contents of the table directory can be viewed through the flow definition’s output dataset.
D) The contents of the table directory can be viewed through the checkpointing directory.
E) The contents of the table directory can be viewed through the metastore.
E) The contents of the table directory can be viewed through the metastore.
A data engineer is creating a live streaming table to be used by other members of their team. They want to indicate that the table contains silver quality data.
Which of the following describes how the data engineer can clarify this to other members of their team? Select two responses.
A) COMMENT “This is a silver table”
B) WHEN QUALITY = SILVER THEN PASS
C) EXPECT QUALITY = SILVER
D) TBLPROPERTIES (“quality” = “silver”)
E) None of these answer choices are correct.
A) COMMENT “This is a silver table”
D) TBLPROPERTIES (“quality” = “silver”)
A data engineer wants to query metrics on the latest update made to their pipeline. The pipeline has multiple data sources. Despite the input data sources having low data retention, the data engineer needs to retain the results of the query indefinitely.
Which of the following statements identifies the type of table that needs to be used and why? Select one response.
A) Live table; live tables only support reading from “append-only” streaming sources.
B) Live table; live tables retain the results of a query for up to 30 days.
C) Streaming live table; streaming live tables record live metrics on the query.
D) Streaming live table; streaming live tables can preserve data indefinitely.
E) Streaming live table; streaming live tables are always “correct”, meaning their contents will match their definition after any update.
D) Streaming live table; streaming live tables can preserve data indefinitely.
Which of the following are guaranteed when processing a change data capture (CDC) feed with APPLY CHANGES INTO? Select three responses.
A) APPLY CHANGES INTO assumes by default that rows will contain inserts and updates.
B) APPLY CHANGES INTO automatically quarantines late-arriving data in a separate table.
C) APPLY CHANGES INTO defaults to creating a Type 1 SCD table.
D) APPLY CHANGES INTO supports insert-only and append-only data.
E) APPLY CHANGES INTO automatically orders late-arriving records using a user-provided sequencing key.
A) APPLY CHANGES INTO assumes by default that rows will contain inserts and updates.
C) APPLY CHANGES INTO defaults to creating a Type 1 SCD table.
E) APPLY CHANGES INTO automatically orders late-arriving records using a user-provided sequencing key.
A data engineer needs to add a file path to their DLT pipeline. They want to use the file path throughout the pipeline as a parameter for various statements and functions.
Which of the following options can be specified during the configuration of a DLT pipeline in order to allow this? Select one response.
A) They can add a key-value pair in the Configurations field and then perform a string substitution of the file path.
B) They can add a widget to the notebook and then perform a string substitution of the file path.
C) They can specify the file path in the job scheduler when deploying the pipeline.
D) They can add a parameter when scheduling the pipeline job and then perform a variable substitution of the file path.
E) They can set the variable in a notebook command and then perform a variable substitution of the file path.
A) They can add a key-value pair in the Configurations field and then perform a string substitution of the file path.
A data engineer has built and deployed a DLT pipeline. They want to perform an update that writes a batch of data to the output directory.
Which of the following statements about performing this update is true? Select one response.
A) With each triggered update, all newly arriving data will be processed through their pipeline. Metrics will not be reported for the current run.
B) All newly arriving data will be continuously processed through their pipeline. Metrics will be reported for the current run if specified during pipeline deployment.
C) With each triggered update, all newly arriving data will be processed through their pipeline. Metrics will always be reported for the current run.
D) With each triggered update, all newly arriving data will be processed through their pipeline. Metrics will be reported if specified during pipeline deployment.
E) All newly arriving data will be continuously processed through their pipeline. Metrics will always be reported for the current run.
C) With each triggered update, all newly arriving data will be processed through their pipeline. Metrics will always be reported for the current run.
Which of the following data quality metrics are captured through row_epectations in a pipeline’s event log? Select three responses.
A) Flow progress
B) Dataset
C) Failed records
D) Update ID
E) Name
B) Dataset
C) Failed records
E) Name