Delta Live Tables Flashcards

Question 1

Q

What are some of the benefits of using Delta Live Tables?

Answer

A

Easy to build and manage reliable data pipelines, delivering high-quality data on Delta Lake
Helps engineering teams simplify ETL development and management with:
- Declarative pipeline development
- Automatic data testing
- Deep visibility for monitoring and recovery

Question 2

Q

What are some of the components necessary to create a new DLT pipeline?

Answer

A

Pipeline Name
Pipeline mode (triggered, continuous)
Paths to notebooks to be run
Storage Location (source)
Target schema name
Cluster mode (fixed size, autoscaling)
Workers
Configuration: key-value parameters to be used by the pipeline (referenced by Notebook code)

Question 3

Q

Scenarios for using DLT pipelines

Answer

A

Regulatory reasons: capture audit information about tables created (lineage). Be able to track data through its transformations
Simplify data pipeline deployment and testing: same code can be used on different data through different environments (dev, staging, prod)
Build and run batch and streaming pipelines in one place

Question 4

Q

Difference between triggered and continuous pipelines

Answer

A

Triggered run once and then shut down until the next manual or scheduled update

Continuous run continuously, investing new data as it arrives

Question 5

Q

Explain the Notebooks, Storage Location and Target parameters resources for a DLT pipeline

Answer

A

Notebooks: standard Databricks Notebooks are used, however the SQL syntax is specilized to DLT tables, and they can only be run by pipelines
Storage Location: allows the user to specify where to store logs, tables and other information related to the pipeline execution. If not specified, DLT will automatically create a directory
Target: if not specified, tables will not be registered to a metastore, but will still be available in the DBFS

Question 6

Q

What is Auto Loader?

Answer

A

Incrementally and efficiently processes new data files as then arrive in cloud storage without any additional setup
Automatically reprocess data that was not loaded using checkpoints
Support schema evolution and schema inference
Supports two file detection modes
- Directory listing
- File notification

Question 7

Q

What are the use cases of Auto Loader?

Answer

A

No file state management:
- Source incrementally processed files as they land in cloud storage
- Don’t need to manage any state information on what files arrived
Scalable:
- Source will efficiently track new files
Easy to use
- Automatically set up notification and message queue services required for incrementally processing files

Question 8

Q

Other features of Auto Loader

Answer

A

Schema of incoming data can be inferred and schema stored
Checkpoints can be used to recover from failures, by storing the offset of byte that was most recently processed
Schema evolution options allow for stream to stop if the incoming data fails the schema validation, and/or evolve or not evolve the schema
When data is ingested from JSON without the schema provided or inferred, fields will all be stored as string

Question 9

Q

Sample AutoLoader syntax (sql)

Answer

A

CREATE OR REFRESH STREAMING TABLE orders_bronze
AS SELECT current_timestamp() processing_time, input_file_name() source_file, *
FROM cloud_files(“${source}/orders”, “json”, map(“cloudFiles.inferColumnTypes”, “true”))

Question 10

Q

What are the two types of persistent tables that can be created with DLT?

Answer

A

Materialised Views/Live Tables: the current result of an y query with each refresh

Streaming Tables: designed for incremental, near-real time data processing. Supports reading from “append-only” streaming sources. Reads once

CREATE OR REFRESH LIVE TABLE orders_by_date
AS SELECT date(order_timestamp) AS order_date, count(*) AS total_daily_orders
FROM LIVE.orders_silver
GROUP BY date(order_timestamp)

Question 11

Q

Three expectations of constraints on Delta Live Tables

Answer

A

Retain invalid records:
CONSTRAINT valid_timestamp EXPECT (timestamp>’2012-01’01)
Drop invalid records:
CONSTRAINT valid_current_page EXPECT
(current_page_id IS NOT NULL AND current_page_tile IS NOT NULL)
ON VIOLATION DROP ROW
Fail/halt execution on invalid records:
CONSTRAINT valid_count EXCEPT (count>0) ON VIOLATION FAIL UPDATE

Question 12

Q

What is CDC (Change Data Capture) data?

Answer

A

CDC is a set of techniques used to identify and capture changes made to data in a database or data source, such as inserts, updates, and deletes, and then propagate those changes to other systems or downstream processes.
The main idea behind CDC is to efficiently track and capture changes without having to reprocess the entire dataset, thus reducing overhead and improving performance.

Question 13

Q

How can APPLY CHANGES INTO be used for CDC?

Answer

A

Performs incremental/streaming ingestion of CDC data
Specify a primary key for the table
Assumption that rows will contain only inserts and updates, but deletes can optionally be applied
Can ignore certain columns with EXCEPT
For example:

CREATE OR REFRESH STREAMING TABLE customers_silver;

APPLY CHANGES INTO LIVE.customers_silver
FROM STREAM(LIVE.customers_bronze_clean)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = “DELETE”
SEQUENCE BY timestamp
COLUMNS * EXCEPT (operation, source_file, _rescued_data)

Question 14

Q

What is the events log and how can it be queried?

Answer

A

A managed Delta Lake table with important fields stored as nested JSON on each pipeline run

event_log_path = f”{DA.paths.storage_location}/system/events”

event_log = spark.read.format(‘delta’).load(event_log_path)
event_log.createOrReplaceTempView(“event_log_raw”)

display(event_log)

Question 15

Q

How can audit logging be performed on the event log?

Answer

A

The JSON for each run can be queried and the user action and user name queried under user_action:action and user_action:user_name

Question 16

Q

How can data lineage be examined from the event log?

Answer

Study These Flashcards

A

The JSON for each run can be queried for “flow_definition” and the “output_dataset” and “input_dataset” can be returned

Question 17

Q

How are existing Delta Live/Streaming Tables queried?

Answer

Study These Flashcards

A

AS SELECT *
FROM LIVE.customers_bronze

or

AS SELECT *
FROM STREAM(LIVE.customers_bronze)

Delta Live Tables Flashcards

(17 cards)