Databricks Lakehouse Platform Flashcards
Describe the relationship between the data lakehouse and the data warehouse
The data lakehouse combines the benefits of data lakes and data warehouses and provides:
- Open, direct access to data stored in standard data formats.
- Indexing protocols optimized for machine learning and data science.
- Low query latency and high reliability for BI and advanced analytics.
- Enables both batch and streaming analytics
What key technologies does Databricks Lakehouse rely on?
- Delta Lake: an optimized storage layer that supports ACID transactions and schema enforcement.
- Unity Catalog: a unified, fine-grained governance solution for data and AI.
What is a data lakehouse used for?
Data lakehouses often use a data design pattern that incrementally improves, enriches, and refines data as it moves through layers of staging and transformation.
- Data ingestion: batch or streaming data arrives from a variety of sources and in a variety of formats. Generally raw. Convert to Delta Tables and verify data.
- Data processing: refine, cleanse, add features and integrate data. Use for ML and analytics use cases or prepare for business needs.
- Data serving: serve clean, enriched data to end users. Be able to track data lineage with Unity Catalog to a single source of truth.
What are ACID guarantees on Databricks?
Delta Lake is used for all reads and writes and builds upon the ACID guarantees:
- Atomicity: all transactions either succeed or fail completely.
- Consistency: how a given state of the data is observed by simultaneous operations.
- Isolation: how simultaneous operations potentially conflict with one another.
- Durability: committed changes are permanent.
Who are the intended users of the data at each stage of the medallion architecture?
- Bronze (raw, data ingestion): data engineers, data operations, compliance and audit teams
- Silver (cleaned and validated): data engineers, data analysts, data scientists
- Gold (aggregated): BI developers/analysts, data scientists, executives and decision-makers, operational teams
Features of data at the Bronze Layer
- Maintains raw state in original formats
- Appended incrementally and grows over time
- Consumed only by data enrichment workloads, not analysts etc
- Single source of truth
- Enables reprocessing and auditing by retaining all historical data
- Can be combination of streaming and batch transactions for sources such as cloud object storage, message buses, and federated systems
- Very minimal validation performed
- Best practice is to store all fields as string (to avoid data loss)
- Metadata fields can be added, such as timestamps, file names
- Leave cleansing and deduplication for Silver Layer
Features of data at the Silver Layer
- Read from one or more bronze or silver tables and write to silver
- Best practice is not to ingest directly to silver (store in bronze first)
- Should always include at least one validated, non-aggregated representation of each record
- Perform cleansing, deduplication, and normalization
- Enhance data quality by correcting errors and inconsistencies
- Structure data into a consumable format for downstream
- Enforce and evolve schemas
- Add data quality checks and enforcement
- Handle out-of-order and late-arriving data
- Join data
- Start data modelling
Features of the Gold Layer
- Highly refined views of the data that drive downstream analytics, dashboards, ML and applications
- Highly aggregated and filtered for time periods or geographic regions
- Meaningful datasets that align to business needs
- Optimised for performance in queries and dashboards
- Data modelling occurs for reporting and analytics
Databricks Architecture: Control Plane
The backend services that Databricks manages in your account. Includes:
- Web application
- Compute orchestration
- Unity Catalog
- Queries and code
Databricks Architecture: Compute Plane
Where your data is processed. There are two types of compute:
1. Serverless compute: resources run in your Databricks account.
2. Classic compute: resources run in your own cloud account.
Databricks Architecture: Workspace storage bucket
A workspace will have an attached cloud storage bucket that contains:
1. Workspace system data: notebook revisions, job run details, command results, logs
2. DBFS: distributed file system in the Databricks environments accessible under dbfs:/
3. Unity Catalog workspace catalog: default workspace catalog. All users can create assets in the default schema
Difference between all-purpose compute and jobs-compute
- All-purpose compute is provisioned to analyse data in notebooks. Generally longer running
- Jobs compute is provisioned to run automated jobs. The job scheduler automatically creates a job for when your job is configured, then terminates the compute when it is complete.
In what scenario would restarting compute be useful?
- Restarting a compute will update it with the latest images. Long-running clusters should be scheduled to restart periodically to refresh the images.
How can multiple languages be used in the one notebook?
One language is set for the notebook, but other languages can be used with language magics (e.g. %python and %sql)
How can a notebook be run from another notebook?
Using the magic command %run {path/to/notebook}