Unity catalog on databricks Flashcards
What is Unity Catalog ?
Unity Catalog is a fine-grained governance solution for data and AI on the Databricks platform. It helps simplify security and governance of your data and AI assets by providing a central place to administer and audit access to data and AI assets.
Unity Catalog provides** centralized access control, auditing, lineage, and data discovery** capabilities across Databricks workspaces.
What is a metastore ?
The metastore is the top-level container for metadata in Unity Catalog. It registers metadata about data and AI assets and the permissions that govern access to them. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.
Physical storage for any given metastore is, by default, isolated from storage for any other metastore in your account.
What are Volumes ?
Volumes are logical volumes of unstructured, non-tabular data in cloud object storage. Volumes can be either managed, with Unity Catalog managing the full lifecycle and layout of the data in storage, or external, with Unity Catalog managing access to the data from within Databricks, but not managing access to the data in cloud storage from other clients.
What are tables ?
Tables are collections of data organized by rows and columns. Tables can be either managed, with Unity Catalog managing the full lifecycle of the table, or external, with Unity Catalog managing access to the data from within Databricks, but not managing access to the data in cloud storage from other clients.
What are view, functions and models ?
Views are saved queries against one or more tables.
Functions are units of saved logic that return a scalar value or set of rows.
Models are AI models packaged with MLflow and registered in Unity Catalog as functions.
What are the securable objects in unity catalog apart from database objects and AI assets ?
Storage credentials, which encapsulate a long-term cloud credential that provides access to cloud storage
External locations, which contain a reference to a storage credential and a cloud storage path. External locations can be used to create external tables or to assign a managed storage location for managed tables and volumes.
Connections, which represent credentials that give read-only access to an external database in a database system like MySQL using Lakehouse Federation.
Clean rooms, which represent a Databricks-managed environment where multiple participants can collaborate on projects without sharing underlying data with each other.
Shares, which are Delta Sharing objects that represent a read-only collection of data and AI assets that a data provider shares with one or more recipients.
Recipients, which are Delta Sharing objects that represent an entity that receives shares from a data provider.
Providers, which are Delta Sharing objects that represent an entity that shares data with a recipient.
How are access grnated to each level of the metastore ?
You can grant and revoke access to securable objects at any level in the hierarchy, including the metastore itself.
Access to an object implicitly grants the same access to all children of that object, unless access is revoked.
You can use typical ANSI SQL commands to grant and revoke access to objects in Unity Catalog.
What is the default access level of non admin users in a workspace on Unity Catalog ?
When a workspace is created, non-admin users have access only to the automatically-provisioned Workspace catalog, which makes this catalog a convenient place for users to try out the process of creating and accessing database objects in Unity Catalog.
Workspace admins and account admins have additional privileges by default. Metastore admin is an optional role, required if you want to manage table and volume storage at the metastore level, and convenient if you want to manage data centrally across multiple workspaces in a region
What are managed Tables ?
Managed tables are fully managed by Unity Catalog, which means that Unity Catalog manages both** the governance and the underlying data files** for each managed table.
Managed tables are stored in a Unity Catalog-**managed location **in your cloud storage. Managed tables always use the Delta Lake format. You can store managed tables at the metastore, catalog, or schema levels.
What are External tables ?
External tables are tables whose access from Databricks is managed by Unity Catalog, but whose data lifecycle and file layout are managed using your cloud provider and other data platforms. Typically you use external tables to register large amounts of your existing data in Databricks, or if you also require write access to the data using tools outside of Databricks. External tables are supported in multiple data formats. Once an external table is registered in a Unity Catalog metastore, you can manage and audit Databricks access to it—and work with it—just like you can with managed tables.
What are managed Volumes ?
Managed volumes are fully managed by Unity Catalog, which means that Unity Catalog manages access to the volume’s storage location in your cloud provider account. When you create a managed volume, it is automatically stored in the managed storage location assigned to the containing schema.
What are extenal volumes ?
External volumes represent existing data in storage locations that are managed outside of Databricks, but registered in Unity Catalog to control and audit access from within Databricks.
When you create an external volume in Databricks, you specify its location, which must be on a path that is defined in a Unity Catalog external location.
How can you define storage location at different levels of Unity catalog ?
Your organization may require that data of certain types be stored within specific accounts or buckets in your cloud tenant.
Unity Catalog gives the ability to **configure storage locations at the metastore, catalog, or schema level **to satisfy such requirements. The system evaluates the hierarchy of storage locations from schema to catalog to metastore.
For example, let’s say your organization has a company compliance policy that requires production data relating to human resources to reside in the bucket s3://mycompany-hr-prod. In Unity Catalog, you can achieve this requirement by setting a location on a catalog level, creating a catalog called, for example hr_prod, and assigning the location s3://mycompany-hr-prod/unity-catalog to it. This means that managed tables or volumes created in the hr_prod catalog (for example, using CREATE TABLE hr_prod.default.table …) **store their data in s3://mycompany-hr-prod/unity-catalog. Optionally, you can choose to provide schema-level locations to organize data within the hr_prod catalog at a more granular level.
If storage isolation is not required for some catalogs, you can optionally set a storage location at the metastore level. This location serves as a default location for managed tables and volumes in catalogs and schemas that don’t have assigned storage
What is workspace catalog binding ?
By default, catalog owners (and metastore admins, if they are defined for the account) can make a catalog accessible to users in multiple workspaces attached to the same Unity Catalog metastore. If you use workspaces to isolate user data access, however, you might want to limit catalog access to specific workspaces in your account, to ensure that certain kinds of data are processed only in those workspaces. You might want separate production and development workspaces, for example, or a separate workspace for processing personal data. This is known as** workspace-catalog binding**.
What is Lakehouse Federation ?
Lakehouse Federation is the query federation platform for Databricks. The term query federation describes a collection of features that enable users and systems to run queries against multiple siloed data sources without needing to migrate all data to a unified system.
Databricks uses Unity Catalog to manage query federation. You use Unity Catalog to configure read-only connections to popular external database systems and create foreign catalogs that mirror external databases. Unity Catalog’s data governance and data lineage tools ensure that data access is managed and audited for all federated queries made by the users in your Databricks workspaces.