Delta Lake Flashcards
What is Delta Lake?
- Open storage format
- Optimised storage layer that provides the foundation for storing data and tables in Databricks Lakehouse Platform
- Open source software that extends Parquet data files with:
- File based transaction log for ACID transactions
- Scalable metadata handling
- Default storage format for all operations on Databricks
- All tables on Databricks are delta tables unless specified
How does Delta Lake implement ACID transactions?
Atomicity
- Each transaction treated as single unit
- Entire statement or none is executed
- Prevents data loss and corruption
Consistency
- Ensures transactions only make changes to tables in predefined, predictable ways
- Ensures that corruption or errors in your data do not create unintended consequences
Isolation
- Ensures that concurrent transactions don’t interfere with or affect one another
Durability
- Ensures that changes to your data made by successfully executed transactions will be saved
Why strive to achieve ACID transactions?
- Ensure the highest possible data reliability and integrity
- Ensure data never falls into an inconsistent state
Features and benefits of Delta Lake (7)
- Schema Enforcement
- ACID Transactions
- Schema Evolution
- Time Travel
- Unified Batch and Streaming
- Scalable Metadata Handling
- Optimistic concurrency control
External vs Managed tables
Managed Tables
- Data stored internally within the Databricks managed storage (DBFS).
- Databricks controls both the metadata and the data. If you drop the table, both the table metadata and the actual data are removed
External Tables
- Data is stored outside Databricks (e.g. cloud object storage)
- The table metadata is still stored within the Databricks metastore
- When dropping a table, only the metadata in the DB metastore is removed, the data remains in the external location
When to use Managed Tables
- Simpler management: DB fully manages both your data and metadata
- It is the default option if a location is not specified when creating a table
- Data is managed in a unified way, especially for Delta format
When to use External Tables
- When you want to store your data outside Databricks and still use it within Databricks
- Useful for datasets that need to be accessed from multiple systems or environments
- Better for data sharing (through cloud storage access)
- Store data independently of the Databricks cluster (persistence)
How to create a managed table (sql)
We create a managed table by not specifying a location. We can specify the schema (db) to use
USE ${da.schema_name}_default_location;
CREATE OR REPLACE TABLE managed_table (width INT, length INT, height INT);
INSERT INTO managed_table
VALUES (3, 2, 1);
SELECT * FROM managed_table;
How do we find the location of a managed table? (sql & python)
By default, managed tables in a schema without the location specified will be created in the dbfs:/user/hive/warehouse/<schema_name>.db/ directory.</schema_name>
%sql
DESCRIBE EXTENDED managed_table;
%python
tbl_location = spark.sql(f”DESCRIBE DETAIL managed_table”).first().location
print(tbl_location)
files = dbutils.fs.ls(tbl_location)
display(files)
What are CTAS statements and why do we use them?
CREATE TABLE AS SELECT statements allow us to create and populate Delta tables using data retrieved from an input query.
- They automatically infer schema information from query results and do NOT support manual schema declaration
- Useful for external data ingestion from sources with well-defined schemas, such as Parquet files and tables
- Do not support specifying additional file options
- Not ideal for files such as CSV (create TEMP VIEW first with manual schema then use CTAS)
CTAS syntax (sql)
CREATE TABLE sales AS
SELECT * FROM parquet.’${variable}.path’;
What are the two types of constraints that can be enforced at table creation, and how are they added (sql)?
Constraints enforce the quality and integrity of data added to a table through automatic verification. When a constraint is violated, the transaction will fail with error.
There are two types:
- NOT NULL (cannot be null), e.g.:
ALTER TABLE people10m ALTER COLUMN middleName DROP NOT NULL; - CHECK (boolean expression) e.g.:
ALTER TABLE people10m ADD CONSTRAINT dateWithinRange CHECK (birthDate > ‘1900-01-01’);
Existing constraints can be viewed with DESCRIBE EXTENDED table;
What are two optional metadata fields we can add to the table at creation and how are they added?
- current_timestamp(): records the timestamp when the logic is executed
- input_file_name(): records the source data file for each record in the table
E.g.
CREATE OR REPLACE TABLE users_pii
COMMENT “Contains PII”
LOCATION “${da.paths.working_dir}/tmp/users_pii”
PARTITIONED BY (first_touch_date)
AS
SELECT *,
cast(cast(user_first_touch_timestamp/1e6 AS TIMESTAMP) AS DATE) first_touch_date,
current_timestamp() updated,
input_file_name() source_file
FROM parquet.${da.paths.datasets}/ecommerce/raw/users-historical/
;
Difference between DEEP CLONE and SHALLOW CLONE
DEEP CLONE fully copies data and metadata from a source table to a target. Occurs incrementally, so executing this command again can sync changes from source to target
SHALLOW CLONE just copies the Delta transaction logs, so the data doesn’t actually move. Good option to quickly create a copy of a table without risking modifying the current table.
Benefits of overwriting a table as opposed to deleting and recreating a table and what are the two ways of overwriting a table with Spark SQL?
- Overwriting is much faster because it doesn’t need to list the directory recursively or delete any files
- The old version can still be accessed via time travel
- It’s an atomic operation. Concurrent queries can still read the table while you are overwriting
- Due to ACID guarantees, if overwriting fails, the table will be in its previous state
Can be accomplished with CREATE OR REPLACE TABLE (CRAS) or INSERT OVERWRITE