Delta Live Tables Flashcards

1
Q

What are some of the benefits of using Delta Live Tables?

A
  • Easy to build and manage reliable data pipelines, delivering high-quality data on Delta Lake
  • Helps engineering teams simplify ETL development and management with:
    • Declarative pipeline development
    • Automatic data testing
    • Deep visibility for monitoring and recovery
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some of the components necessary to create a new DLT pipeline?

A
  • Pipeline Name
  • Pipeline mode (triggered, continuous)
  • Paths to notebooks to be run
  • Storage Location (source)
  • Target schema name
  • Cluster mode (fixed size, autoscaling)
  • Workers
  • Configuration: key-value parameters to be used by the pipeline (referenced by Notebook code)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Scenarios for using DLT pipelines

A
  • Regulatory reasons: capture audit information about tables created (lineage). Be able to track data through its transformations
  • Simplify data pipeline deployment and testing: same code can be used on different data through different environments (dev, staging, prod)
  • Build and run batch and streaming pipelines in one place
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Difference between triggered and continuous pipelines

A

Triggered run once and then shut down until the next manual or scheduled update

Continuous run continuously, investing new data as it arrives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the Notebooks, Storage Location and Target parameters resources for a DLT pipeline

A
  • Notebooks: standard Databricks Notebooks are used, however the SQL syntax is specilized to DLT tables, and they can only be run by pipelines
  • Storage Location: allows the user to specify where to store logs, tables and other information related to the pipeline execution. If not specified, DLT will automatically create a directory
  • Target: if not specified, tables will not be registered to a metastore, but will still be available in the DBFS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Auto Loader?

A
  • Incrementally and efficiently processes new data files as then arrive in cloud storage without any additional setup
  • Automatically reprocess data that was not loaded using checkpoints
  • Support schema evolution and schema inference
  • Supports two file detection modes
    • Directory listing
    • File notification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the use cases of Auto Loader?

A
  1. No file state management:
    • Source incrementally processed files as they land in cloud storage
    • Don’t need to manage any state information on what files arrived
  2. Scalable:
    • Source will efficiently track new files
  3. Easy to use
    • Automatically set up notification and message queue services required for incrementally processing files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Other features of Auto Loader

A
  • Schema of incoming data can be inferred and schema stored
  • Checkpoints can be used to recover from failures, by storing the offset of byte that was most recently processed
  • Schema evolution options allow for stream to stop if the incoming data fails the schema validation, and/or evolve or not evolve the schema
  • When data is ingested from JSON without the schema provided or inferred, fields will all be stored as string
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sample AutoLoader syntax (sql, python)

A

CREATE OR REFRESH STREAMING TABLE orders_bronze
AS SELECT current_timestamp() processing_time, input_file_name() source_file, *
FROM cloud_files(“${source}/orders”, “json”, map(“cloudFiles.inferColumnTypes”, “true”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two types of persistent tables that can be created with DLT?

A

Materialised Views/Live Tables: the current result of an y query with each refresh

Streaming Tables: designed for incremental, near-real time data processing. Supports reading from “append-only” streaming sources. Reads once

CREATE OR REFRESH LIVE TABLE orders_by_date
AS SELECT date(order_timestamp) AS order_date, count(*) AS total_daily_orders
FROM LIVE.orders_silver
GROUP BY date(order_timestamp)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Three expectations of constraints on Delta Live Tables

A
  1. Retain invalid records:
    CONSTRAINT valid_timestamp EXPECT (timestamp>’2012-01’01)
  2. Drop invalid records:
    CONSTRAINT valid_current_page EXPECT
    (current_page_id IS NOT NULL AND current_page_tile IS NOT NULL)
    ON VIOLDATION DROP ROW
  3. Fail/halt execution on invalid records:
    CONSTRAINT valid_count EXCEPT (count>0) ON VIOLATION FAIL UPDATE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is CDC (Change Data Capture) data?

A
  • CDC is a set of techniques used to identify and capture changes made to data in a database or data source, such as inserts, updates, and deletes, and then propagate those changes to other systems or downstream processes.
  • The main idea behind CDC is to efficiently track and capture changes without having to reprocess the entire dataset, thus reducing overhead and improving performance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can APPLY CHANGES INTO be used for CDC?

A
  • Performs incremental/streaming ingestion of CDC data
  • Specify a primary key for the table
  • Assumption that rows will contain only inserts and updates, but deletes can optionally be applied
  • Can ignore certain columns with EXCEPT
    For example:

CREATE OR REFRESH STREAMING TABLE customers_silver;

APPLY CHANGES INTO LIVE.customers_silver
FROM STREAM(LIVE.customers_bronze_clean)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = “DELETE”
SEQUENCE BY timestamp
COLUMNS * EXCEPT (operation, source_file, _rescued_data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the events log and how can it be queried?

A

A managed Delta Lake table with important fields stored as nested JSON on each pipeline run

event_log_path = f”{DA.paths.storage_location}/system/events”

event_log = spark.read.format(‘delta’).load(event_log_path)
event_log.createOrReplaceTempView(“event_log_raw”)

display(event_log)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can audit logging be performed on the event log?

A

The JSON for each run can be queried and the user action and user name queried under user_action:action and user_action:user_name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can data lineage be examined from the event log?

A

The JSON for each run can be queried for “flow_definition” and the “output_dataset” and “input_dataset” can be returned

17
Q

How are existing Delta Live/Streaming Tables queried?

A

AS SELECT *
FROM LIVE.customers_bronze

or

AS SELECT *
FROM STREAM(LIVE.customers_bronze)