Delta Live Tables Flashcards
What are some of the benefits of using Delta Live Tables?
- Easy to build and manage reliable data pipelines, delivering high-quality data on Delta Lake
- Helps engineering teams simplify ETL development and management with:
- Declarative pipeline development
- Automatic data testing
- Deep visibility for monitoring and recovery
What are some of the components necessary to create a new DLT pipeline?
- Pipeline Name
- Pipeline mode (triggered, continuous)
- Paths to notebooks to be run
- Storage Location (source)
- Target schema name
- Cluster mode (fixed size, autoscaling)
- Workers
- Configuration: key-value parameters to be used by the pipeline (referenced by Notebook code)
Scenarios for using DLT pipelines
- Regulatory reasons: capture audit information about tables created (lineage). Be able to track data through its transformations
- Simplify data pipeline deployment and testing: same code can be used on different data through different environments (dev, staging, prod)
- Build and run batch and streaming pipelines in one place
Difference between triggered and continuous pipelines
Triggered run once and then shut down until the next manual or scheduled update
Continuous run continuously, investing new data as it arrives
Explain the Notebooks, Storage Location and Target parameters resources for a DLT pipeline
- Notebooks: standard Databricks Notebooks are used, however the SQL syntax is specilized to DLT tables, and they can only be run by pipelines
- Storage Location: allows the user to specify where to store logs, tables and other information related to the pipeline execution. If not specified, DLT will automatically create a directory
- Target: if not specified, tables will not be registered to a metastore, but will still be available in the DBFS
What is Auto Loader?
- Incrementally and efficiently processes new data files as then arrive in cloud storage without any additional setup
- Automatically reprocess data that was not loaded using checkpoints
- Support schema evolution and schema inference
- Supports two file detection modes
- Directory listing
- File notification
What are the use cases of Auto Loader?
- No file state management:
- Source incrementally processed files as they land in cloud storage
- Don’t need to manage any state information on what files arrived
- Scalable:
- Source will efficiently track new files
- Easy to use
- Automatically set up notification and message queue services required for incrementally processing files
Other features of Auto Loader
- Schema of incoming data can be inferred and schema stored
- Checkpoints can be used to recover from failures, by storing the offset of byte that was most recently processed
- Schema evolution options allow for stream to stop if the incoming data fails the schema validation, and/or evolve or not evolve the schema
- When data is ingested from JSON without the schema provided or inferred, fields will all be stored as string
Sample AutoLoader syntax (sql)
CREATE OR REFRESH STREAMING TABLE orders_bronze
AS SELECT current_timestamp() processing_time, input_file_name() source_file, *
FROM cloud_files(“${source}/orders”, “json”, map(“cloudFiles.inferColumnTypes”, “true”))
What are the two types of persistent tables that can be created with DLT?
Materialised Views/Live Tables: the current result of an y query with each refresh
Streaming Tables: designed for incremental, near-real time data processing. Supports reading from “append-only” streaming sources. Reads once
CREATE OR REFRESH LIVE TABLE orders_by_date
AS SELECT date(order_timestamp) AS order_date, count(*) AS total_daily_orders
FROM LIVE.orders_silver
GROUP BY date(order_timestamp)
Three expectations of constraints on Delta Live Tables
- Retain invalid records:
CONSTRAINT valid_timestamp EXPECT (timestamp>’2012-01’01) - Drop invalid records:
CONSTRAINT valid_current_page EXPECT
(current_page_id IS NOT NULL AND current_page_tile IS NOT NULL)
ON VIOLATION DROP ROW - Fail/halt execution on invalid records:
CONSTRAINT valid_count EXCEPT (count>0) ON VIOLATION FAIL UPDATE
What is CDC (Change Data Capture) data?
- CDC is a set of techniques used to identify and capture changes made to data in a database or data source, such as inserts, updates, and deletes, and then propagate those changes to other systems or downstream processes.
- The main idea behind CDC is to efficiently track and capture changes without having to reprocess the entire dataset, thus reducing overhead and improving performance.
How can APPLY CHANGES INTO be used for CDC?
- Performs incremental/streaming ingestion of CDC data
- Specify a primary key for the table
- Assumption that rows will contain only inserts and updates, but deletes can optionally be applied
- Can ignore certain columns with EXCEPT
For example:
CREATE OR REFRESH STREAMING TABLE customers_silver;
APPLY CHANGES INTO LIVE.customers_silver
FROM STREAM(LIVE.customers_bronze_clean)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = “DELETE”
SEQUENCE BY timestamp
COLUMNS * EXCEPT (operation, source_file, _rescued_data)
What is the events log and how can it be queried?
A managed Delta Lake table with important fields stored as nested JSON on each pipeline run
event_log_path = f”{DA.paths.storage_location}/system/events”
event_log = spark.read.format(‘delta’).load(event_log_path)
event_log.createOrReplaceTempView(“event_log_raw”)
display(event_log)
How can audit logging be performed on the event log?
The JSON for each run can be queried and the user action and user name queried under user_action:action and user_action:user_name