Databricks Lakehouse Platform Flashcards

Question 1

Q

Describe the relationship between the data lakehouse and the data warehouse

Answer

A

The data lakehouse combines the benefits of data lakes and data warehouses and provides:

Open, direct access to data stored in standard data formats.
Indexing protocols optimized for machine learning and data science.
Low query latency and high reliability for BI and advanced analytics.
Enables both batch and streaming analytics

Question 2

Q

What key technologies does Databricks Lakehouse rely on?

Answer

A

Delta Lake: an optimized storage layer that supports ACID transactions and schema enforcement.
Unity Catalog: a unified, fine-grained governance solution for data and AI.

Question 3

Q

What is a data lakehouse used for?

Answer

A

Data lakehouses often use a data design pattern that incrementally improves, enriches, and refines data as it moves through layers of staging and transformation.

Data ingestion: batch or streaming data arrives from a variety of sources and in a variety of formats. Generally raw. Convert to Delta Tables and verify data.
Data processing: refine, cleanse, add features and integrate data. Use for ML and analytics use cases or prepare for business needs.
Data serving: serve clean, enriched data to end users. Be able to track data lineage with Unity Catalog to a single source of truth.

Question 4

Q

What are ACID guarantees on Databricks?

Answer

A

Delta Lake is used for all reads and writes and builds upon the ACID guarantees:
- Atomicity: all transactions either succeed or fail completely.
- Consistency: how a given state of the data is observed by simultaneous operations.
- Isolation: how simultaneous operations potentially conflict with one another.
- Durability: committed changes are permanent.

Question 5

Q

Who are the intended users of the data at each stage of the medallion architecture?

Answer

A

Bronze (raw, data ingestion): data engineers, data operations, compliance and audit teams
Silver (cleaned and validated): data engineers, data analysts, data scientists
Gold (aggregated): BI developers/analysts, data scientists, executives and decision-makers, operational teams

Question 6

Q

Features of data at the Bronze Layer

Answer

A

Maintains raw state in original formats
Appended incrementally and grows over time
Consumed only by data enrichment workloads, not analysts etc
Single source of truth
Enables reprocessing and auditing by retaining all historical data
Can be combination of streaming and batch transactions for sources such as cloud object storage, message buses, and federated systems
Very minimal validation performed
Best practice is to store all fields as string (to avoid data loss)
Metadata fields can be added, such as timestamps, file names
Leave cleansing and deduplication for Silver Layer

Question 7

Q

Features of data at the Silver Layer

Answer

A

Read from one or more bronze or silver tables and write to silver
Best practice is not to ingest directly to silver (store in bronze first)
Should always include at least one validated, non-aggregated representation of each record
Perform cleansing, deduplication, and normalization
Enhance data quality by correcting errors and inconsistencies
Structure data into a consumable format for downstream
Enforce and evolve schemas
Add data quality checks and enforcement
Handle out-of-order and late-arriving data
Join data
Start data modelling

Question 8

Q

Features of the Gold Layer

Answer

A

Highly refined views of the data that drive downstream analytics, dashboards, ML and applications
Highly aggregated and filtered for time periods or geographic regions
Meaningful datasets that align to business needs
Optimised for performance in queries and dashboards
Data modelling occurs for reporting and analytics

Question 9

Q

Databricks Architecture: Control Plane

Answer

A

The backend services that Databricks manages in your account. Includes:
- Web application
- Compute orchestration
- Unity Catalog
- Queries and code

Question 10

Q

Databricks Architecture: Compute Plane

Answer

A

Where your data is processed. There are two types of compute:
1. Serverless compute: resources run in your Databricks account.
2. Classic compute: resources run in your own cloud account.

Question 11

Q

Databricks Architecture: Workspace storage bucket

Answer

A

A workspace will have an attached cloud storage bucket that contains:
1. Workspace system data: notebook revisions, job run details, command results, logs
2. DBFS: distributed file system in the Databricks environments accessible under dbfs:/
3. Unity Catalog workspace catalog: default workspace catalog. All users can create assets in the default schema

Question 12

Q

Difference between all-purpose compute and jobs-compute

Answer

A

All-purpose compute is provisioned to analyse data in notebooks. Generally longer running
Jobs compute is provisioned to run automated jobs. The job scheduler automatically creates a job for when your job is configured, then terminates the compute when it is complete.

Question 13

Q

In what scenario would restarting compute be useful?

Answer

A

Restarting a compute will update it with the latest images. Long-running clusters should be scheduled to restart periodically to refresh the images.

Question 14

Q

How can multiple languages be used in the one notebook?

Answer

A

One language is set for the notebook, but other languages can be used with language magics (e.g. %python and %sql)

Question 15

Q

How can a notebook be run from another notebook?

Answer

A

Using the magic command %run {path/to/notebook}

Question 16

Q

Databricks Runtime Versions

Answer

Study These Flashcards

A

Standard: Apache Spark + other component for optimised big data analytics
Machine Learning: adds popular ML libraries like TensforFlow, Keras etc
Photon: an optional add-on to optimize Spark queries

Question 17

Q

How do Databricks Repos enable CI/CD workflows?

Answer

Study These Flashcards

A

Databricks Repos have native integration with large git platforms
Git automation (e.g. a merge to main) can call the Databricks Repos API, which can bring the Repo in Production folder to the latest version. A Databricks job can then be run based on a Repo in a Production folder.

Question 18

Q

What are the Git operations available via Databricks Repos?

Answer

Study These Flashcards

A

Clone, push to and pull from remote Git Repos
Create and manage branches for development
Create and edit notebooks and other files
Visually compare notebook differences upon commit

Question 19

Q

What are some limitations of Notebooks relative to Repos?

Answer

Study These Flashcards

A

Version control is tied to individual notebooks, whereas repos are version controlled at the repo-level (i.e. multiple files, entire codebase)
Notebook version control is built into the Databricks UI and is not directly integration with Git systems.

Databricks Lakehouse Platform Flashcards

(19 cards)