Azure Databricks Flashcards
What is medallion architecture?
A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables).
In medallion architecture, what is the bronze layer?
The landing zone for raw data.
In medallion architecture, what is the silver layer?
cleansed and conformed data
In the Silver layer of the lakehouse, the data from the Bronze layer is combined, organized, and cleaned up to create a comprehensive view of the important aspects of the business, such as customers, stores, transactions, and reference tables. This helps to ensure that the Silver layer provides a unified and reliable representation of the enterprise’s key information.
T or F: From a data modeling perspective, the Silver Layer has more 3rd-Normal Form like data models
True
In medallion architecture, what is the gold layer?
curated business-level tables. Uses more de-normalized and read-optimized datamodels with fewer joins. (flattened data)
What are the benefits of a lakehouse architecture? (5 items)
Simple data model
Easy to understand and implement
Enables incremental ETL
Can recreate your tables from raw data at any time
ACID transactions, time travel
What is the concept of a data mesh?
Bronze and silver tables can be joined together in a “one-to-many” fashion, meaning that the data in a single upstream table could be used to generate multiple downstream tables.
What is databricks?
Databricks offers a powerful workspace that integrates various components and tools to simplify the end-to-end data lifecycle. It provides capabilities for data ingestion, data preparation, data exploration, machine learning, and visualization.
What programming languages does databricks support?
Using Jupyter notebooks, The platform supports multiple programming languages, including Python, Scala, R, and SQL,
T or F: Databricks has Apache Spark capabilities
T
T or F: You can manage spark clusters through APIS with datbricks?
T
What do databricks alerts do?
you could set up an alert to monitor certain data streams and then automatically create support tickets if those data streams or queries exceeds certain thresholds.
What are apache spark clusters?
Spark is a distributed data processing solution that makes use of clusters to scale processing across multiple compute nodes. Each Spark cluster has a driver node to coordinate processing jobs, and one or more worker nodes on which the processing occurs.
What is a databricks file system?
The nodes in a spark cluster have access to a shared, distributed file system in which they can access and operate on data files. The Databricks File System (DBFS) enables you to mount cloud storage and use it to work with and persist file-based data.
What is a notebook?
One of the most common ways for data analysts, data scientists, data engineers, and developers to work with Spark is to write code in notebooks.
What is a hive metastore?
Hive is an open source technology used to define a relational abstraction layer of tables over file-based data. The tables can then be queried using SQL syntax.
What is a delta lake?
Delta Lake builds on the relational table schema abstraction over files in the data lake to add support for SQL semantics commonly found in relational database systems. SQL semantics are ie ddl statements, dml statements, etc
How is databricks used as a data workflow tool?
it can be used to flow data from one source to another and manipulate that data along the way. So what that means is that the tools that integrate with Databricks best are the tools that accept the data flow. So for example, Databricks can flow its data to data stores like Azure Synapse Analytics for extra analytics, of course, it can flow its data to something like Azure Data Lake too.
How is Azure Active Directory used in Databricks?
Each user or even groups for that matter, can be granted various levels of permissions to your resource or even your data.
What is an azure databricks workspace?
workspace is an analytics platform that’s actually based on Apache Spark. Databricks integrate Spark into Azure and makes the interaction with Spark seamless for collaborators like data engineers and data scientists.
What is a sql warehouse?
SQL Warehouses are relational compute resources with endpoints that enable client applications to connect to an Azure Databricks workspace and use SQL to work with data in tables.