Azure Databricks Flashcards
What is medallion architecture?
A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables).
In medallion architecture, what is the bronze layer?
The landing zone for raw data.
In medallion architecture, what is the silver layer?
cleansed and conformed data
In the Silver layer of the lakehouse, the data from the Bronze layer is combined, organized, and cleaned up to create a comprehensive view of the important aspects of the business, such as customers, stores, transactions, and reference tables. This helps to ensure that the Silver layer provides a unified and reliable representation of the enterprise’s key information.
T or F: From a data modeling perspective, the Silver Layer has more 3rd-Normal Form like data models
True
In medallion architecture, what is the gold layer?
curated business-level tables. Uses more de-normalized and read-optimized datamodels with fewer joins. (flattened data)
What are the benefits of a lakehouse architecture? (5 items)
Simple data model
Easy to understand and implement
Enables incremental ETL
Can recreate your tables from raw data at any time
ACID transactions, time travel
What is the concept of a data mesh?
Bronze and silver tables can be joined together in a “one-to-many” fashion, meaning that the data in a single upstream table could be used to generate multiple downstream tables.
What is databricks?
Databricks offers a powerful workspace that integrates various components and tools to simplify the end-to-end data lifecycle. It provides capabilities for data ingestion, data preparation, data exploration, machine learning, and visualization.
What programming languages does databricks support?
Using Jupyter notebooks, The platform supports multiple programming languages, including Python, Scala, R, and SQL,
T or F: Databricks has Apache Spark capabilities
T
T or F: You can manage spark clusters through APIS with datbricks?
T
What do databricks alerts do?
you could set up an alert to monitor certain data streams and then automatically create support tickets if those data streams or queries exceeds certain thresholds.
What are apache spark clusters?
Spark is a distributed data processing solution that makes use of clusters to scale processing across multiple compute nodes. Each Spark cluster has a driver node to coordinate processing jobs, and one or more worker nodes on which the processing occurs.
What is a databricks file system?
The nodes in a spark cluster have access to a shared, distributed file system in which they can access and operate on data files. The Databricks File System (DBFS) enables you to mount cloud storage and use it to work with and persist file-based data.
What is a notebook?
One of the most common ways for data analysts, data scientists, data engineers, and developers to work with Spark is to write code in notebooks.