Delta Live tables Flashcards
What are Delta Live Tables ?
Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling.
How do you manage data quality in DLT pipeline?
You can data quality with Delta Live Tables expectations, which allow you to define expected data quality and specify how to handle records that fail those expectations.
What are DLT Datasets ?
Delta Live Tables datasets are the
* streaming tables,
* materialized views, and
* views maintained as the results of declarative queries
What are Streaming Tables ?
A streaming table is a Delta table with extra support for** streaming or incremental data processing. Streaming tables allow you to process a growing dataset, handling each row only once. Because most datasets grow continuously over time, streaming tables are good for most ingestion workloads. Streaming tables are optimal for pipelines that require data freshness and low latency. Streaming tables can also be useful for massive scale transformations, as results can be incrementally calculated as new data arrives, keeping results up to date without needing to fully recompute all source data with each update**. Streaming tables are designed for data sources that are append-only.
What are Views ?
All views in Databricks compute results from source datasets as they are queried, leveraging caching optimizations when available.
Delta Live Tables does not publish views to the catalog, so views can be referenced only within the pipeline in which they are defined.
Views are useful as intermediate queries that should not be exposed to end users or systems.
Databricks recommends using views to enforce data quality constraints or transform and enrich datasets that drive multiple downstream queries.
What is a DLT Pipeline ?
A pipeline is the main unit used to configure and run data processing workflows with Delta Live Tables.
A pipeline contains materialized views and streaming tables declared in Python or SQL source files.
Delta Live Tables infers the dependencies between these tables, ensuring updates occur in the correct order. For each dataset, Delta Live Tables compares the current state with the desired state and proceeds to create or update datasets using efficient processing methods.
How are pipelines updated ?
Pipelines deploy infrastructure and recompute data state when you start an update. An update does the following:
- Starts a cluster with the correct configuration.
- Discovers all the tables and views defined, and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.
- Creates or updates tables and views with the most recent data available.
Pipelines can be run continuously or on a schedule depending on your use case’s cost and latency requirements.
Which data sources are supported for DLT?
Delta Live Tables supports all data sources available in Databricks.
Databricks recommends using streaming tables for most ingestion use cases.
For files arriving in cloud object storage, Databricks recommends Auto Loader.
You can directly ingest data with Delta Live Tables from most message buses.
What are limitations of DLT ?
- All tables created and updated by Delta Live Tables are Delta tables.
- Delta Live Tables tables can only be defined once, meaning they can only be the target of a single operation in all Delta Live Tables pipelines.
- Identity columns are not supported with tables that are the target of APPLY CHANGES INTO and might be recomputed during updates for materialized views. For this reason, Databricks recommends using identity columns in Delta Live Tables only with streaming tables.
- A Databricks workspace is limited to 100 concurrent pipeline updates.
What are different Product edition in a DLT Pipeline ?
Select the Delta Live Tables product edition with the best features for your pipeline requirements. The following product editions are available:
Core to run streaming ingest workloads. Select the Core edition if your pipeline** doesn’t require advanced features such as change data capture (CDC) or Delta Live Tables expectations**.
Pro to run streaming ingest and CDC workloads. The Pro product edition supports all of the Core features, plus support for workloads that require updating tables based on changes in source data.
Advanced to run streaming ingest workloads, CDC workloads, and workloads that require expectations. The Advanced product edition supports the features of the Core and Pro editions and includes data quality constraints with Delta Live Tables expectations.
Note: If your pipeline includes features not supported by the selected product edition, such as expectations,** you will receive an error message explaining the reason for the error**. You can then edit the pipeline to select the appropriate edition.
What are the benefits of Serverless Pipelines beside the ease of configuration ?
In addition to simplifying configuration, serverless pipelines have the following features:
- Incremental refresh for Materialized views: Updates for materialized views are refreshed incrementally whenever possible. Incremental refresh has the same results as full recomputation. The update uses a full refresh if results cannot be computed incrementally. See Incremental refresh for materialized views.
- Stream pipelining: To improve utilization, throughput, and latency for streaming data workloads such as data ingestion, microbatches are pipelined. In other words, instead of running microbatches sequentially like standard Spark Structured Streaming, serverless DLT pipelines runs microbatches concurrently, improving compute resource utilization. Stream pipelining is enabled by default in serverless DLT pipelines.
- Vertical autoscaling: serverless DLT pipelines adds to the horizontal autoscaling provided by Databricks enhanced autoscaling by automatically allocating the most cost-efficient instance types that can run your Delta Live Tables pipeline without failing because of out-of-memory errors
What is Enhanced AutoScaling and what parameter it use to add or remove nodes ?
Databricks enhanced autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact on the data processing latency of your pipelines.
Enhanced autoscaling uses two metrics to decide on scaling up or scaling down:
Task slot utilization: This is the average ratio of the number of busy task slots to the total task slots available in the cluster.
Task queue size: This is the number of tasks waiting to be executed in task slots.
In DLT Pipelines with Unity Catalog support, who manages the lifecycle of the tables created ?
When Delta Live Tables is configured to persist data to Unity Catalog, the lifecycle of the table is managed by the Delta Live Tables pipeline. Because the pipeline manages the table lifecycle and permissions:
When a table is removed from the Delta Live Tables pipeline definition, the corresponding materialized view or streaming table entry is removed from Unity Catalog on the next pipeline update. The actual data is retained for a period so that it can be recovered if deleted by mistake. The data can be recovered by adding the materialized view or streaming table back into the pipeline definition.
Deleting the Delta Live Tables pipeline results in deleting all tables defined in that pipeline. Because of this change, the Delta Live Tables UI is updated to prompt you to confirm the deletion of a pipeline.
Internal backing tables, including those used to support APPLY CHANGES INTO, are not directly accessible by users.