Databricks Lakehouse Platform Flashcards

1
Q

Describe the relationship between the data lakehouse and the data warehouse

A

The data lakehouse combines the benefits of data lakes and data warehouses and provides:

  • Open, direct access to data stored in standard data formats.
  • Indexing protocols optimized for machine learning and data science.
  • Low query latency and high reliability for BI and advanced analytics.
  • Enables both batch and streaming analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What key technologies does Databricks Lakehouse rely on?

A
  1. Delta Lake: an optimized storage layer that supports ACID transactions and schema enforcement.
  2. Unity Catalog: a unified, fine-grained governance solution for data and AI.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a data lakehouse used for?

A

Data lakehouses often use a data design pattern that incrementally improves, enriches, and refines data as it moves through layers of staging and transformation.

  1. Data ingestion: batch or streaming data arrives from a variety of sources and in a variety of formats. Generally raw. Convert to Delta Tables and verify data.
  2. Data processing: refine, cleanse, add features and integrate data. Use for ML and analytics use cases or prepare for business needs.
  3. Data serving: serve clean, enriched data to end users. Be able to track data lineage with Unity Catalog to a single source of truth.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are ACID guarantees on Databricks?

A

Delta Lake is used for all reads and writes and builds upon the ACID guarantees:
- Atomicity: all transactions either succeed or fail completely.
- Consistency: how a given state of the data is observed by simultaneous operations.
- Isolation: how simultaneous operations potentially conflict with one another.
- Durability: committed changes are permanent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Who are the intended users of the data at each stage of the medallion architecture?

A
  • Bronze (raw, data ingestion): data engineers, data operations, compliance and audit teams
  • Silver (cleaned and validated): data engineers, data analysts, data scientists
  • Gold (aggregated): BI developers/analysts, data scientists, executives and decision-makers, operational teams
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Features of data at the Bronze Layer

A
  • Maintains raw state in original formats
  • Appended incrementally and grows over time
  • Consumed only by data enrichment workloads, not analysts etc
  • Single source of truth
  • Enables reprocessing and auditing by retaining all historical data
  • Can be combination of streaming and batch transactions for sources such as cloud object storage, message buses, and federated systems
  • Very minimal validation performed
  • Best practice is to store all fields as string (to avoid data loss)
  • Metadata fields can be added, such as timestamps, file names
  • Leave cleansing and deduplication for Silver Layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Features of data at the Silver Layer

A
  • Read from one or more bronze or silver tables and write to silver
  • Best practice is not to ingest directly to silver (store in bronze first)
  • Should always include at least one validated, non-aggregated representation of each record
  • Perform cleansing, deduplication, and normalization
  • Enhance data quality by correcting errors and inconsistencies
  • Structure data into a consumable format for downstream
  • Enforce and evolve schemas
  • Add data quality checks and enforcement
  • Handle out-of-order and late-arriving data
  • Join data
  • Start data modelling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Features of the Gold Layer

A
  • Highly refined views of the data that drive downstream analytics, dashboards, ML and applications
  • Highly aggregated and filtered for time periods or geographic regions
  • Meaningful datasets that align to business needs
  • Optimised for performance in queries and dashboards
  • Data modelling occurs for reporting and analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Databricks Architecture: Control Plane

A

The backend services that Databricks manages in your account. Includes:
- Web application
- Compute orchestration
- Unity Catalog
- Queries and code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Databricks Architecture: Compute Plane

A

Where your data is processed. There are two types of compute:
1. Serverless compute: resources run in your Databricks account.
2. Classic compute: resources run in your own cloud account.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Databricks Architecture: Workspace storage bucket

A

A workspace will have an attached cloud storage bucket that contains:
1. Workspace system data: notebook revisions, job run details, command results, logs
2. DBFS: distributed file system in the Databricks environments accessible under dbfs:/
3. Unity Catalog workspace catalog: default workspace catalog. All users can create assets in the default schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Difference between all-purpose compute and jobs-compute

A
  • All-purpose compute is provisioned to analyse data in notebooks. Generally longer running
  • Jobs compute is provisioned to run automated jobs. The job scheduler automatically creates a job for when your job is configured, then terminates the compute when it is complete.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In what scenario would restarting compute be useful?

A
  • Restarting a compute will update it with the latest images. Long-running clusters should be scheduled to restart periodically to refresh the images.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can multiple languages be used in the one notebook?

A

One language is set for the notebook, but other languages can be used with language magics (e.g. %python and %sql)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can a notebook be run from another notebook?

A

Using the magic command %run {path/to/notebook}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Databricks Runtime Versions

A
  1. Standard: Apache Spark + other component for optimised big data analytics
  2. Machine Learning: adds popular ML libraries like TensforFlow, Keras etc
  3. Photon: an optional add-on to optimize Spark queries
17
Q

How do Databricks Repos enable CI/CD workflows?

A
  • Databricks Repos have native integration with large git platforms
  • Git automation (e.g. a merge to main) can call the Databricks Repos API, which can bring the Repo in Production folder to the latest version. A Databricks job can then be run based on a Repo in a Production folder.
18
Q

What are the Git operations available via Databricks Repos?

A
  • Clone, push to and pull from remote Git Repos
  • Create and manage branches for development
  • Create and edit notebooks and other files
  • Visually compare notebook differences upon commit
19
Q

What are some limitations of Notebooks relative to Repos?

A
  • Version control is tied to individual notebooks, whereas repos are version controlled at the repo-level (i.e. multiple files, entire codebase)
  • Notebook version control is built into the Databricks UI and is not directly integration with Git systems.