Unity catalog on databricks Flashcards

1
Q

What is Unity Catalog ?

A

Unity Catalog is a fine-grained governance solution for data and AI on the Databricks platform. It helps simplify security and governance of your data and AI assets by providing a central place to administer and audit access to data and AI assets.

Unity Catalog provides** centralized access control, auditing, lineage, and data discovery** capabilities across Databricks workspaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a metastore ?

A

The metastore is the top-level container for metadata in Unity Catalog. It registers metadata about data and AI assets and the permissions that govern access to them. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.

Physical storage for any given metastore is, by default, isolated from storage for any other metastore in your account.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Volumes ?

A

Volumes are logical volumes of unstructured, non-tabular data in cloud object storage. Volumes can be either managed, with Unity Catalog managing the full lifecycle and layout of the data in storage, or external, with Unity Catalog managing access to the data from within Databricks, but not managing access to the data in cloud storage from other clients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are tables ?

A

Tables are collections of data organized by rows and columns. Tables can be either managed, with Unity Catalog managing the full lifecycle of the table, or external, with Unity Catalog managing access to the data from within Databricks, but not managing access to the data in cloud storage from other clients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are view, functions and models ?

A

Views are saved queries against one or more tables.

Functions are units of saved logic that return a scalar value or set of rows.

Models are AI models packaged with MLflow and registered in Unity Catalog as functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the securable objects in unity catalog apart from database objects and AI assets ?

A

Storage credentials, which encapsulate a long-term cloud credential that provides access to cloud storage

External locations, which contain a reference to a storage credential and a cloud storage path. External locations can be used to create external tables or to assign a managed storage location for managed tables and volumes.

Connections, which represent credentials that give read-only access to an external database in a database system like MySQL using Lakehouse Federation.

Clean rooms, which represent a Databricks-managed environment where multiple participants can collaborate on projects without sharing underlying data with each other.

Shares, which are Delta Sharing objects that represent a read-only collection of data and AI assets that a data provider shares with one or more recipients.

Recipients, which are Delta Sharing objects that represent an entity that receives shares from a data provider.

Providers, which are Delta Sharing objects that represent an entity that shares data with a recipient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are access grnated to each level of the metastore ?

A

You can grant and revoke access to securable objects at any level in the hierarchy, including the metastore itself.

Access to an object implicitly grants the same access to all children of that object, unless access is revoked.

You can use typical ANSI SQL commands to grant and revoke access to objects in Unity Catalog.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the default access level of non admin users in a workspace on Unity Catalog ?

A

When a workspace is created, non-admin users have access only to the automatically-provisioned Workspace catalog, which makes this catalog a convenient place for users to try out the process of creating and accessing database objects in Unity Catalog.

Workspace admins and account admins have additional privileges by default. Metastore admin is an optional role, required if you want to manage table and volume storage at the metastore level, and convenient if you want to manage data centrally across multiple workspaces in a region

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are managed Tables ?

A

Managed tables are fully managed by Unity Catalog, which means that Unity Catalog manages both** the governance and the underlying data files** for each managed table.
Managed tables are stored in a Unity Catalog-**managed location **in your cloud storage. Managed tables always use the Delta Lake format. You can store managed tables at the metastore, catalog, or schema levels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are External tables ?

A

External tables are tables whose access from Databricks is managed by Unity Catalog, but whose data lifecycle and file layout are managed using your cloud provider and other data platforms. Typically you use external tables to register large amounts of your existing data in Databricks, or if you also require write access to the data using tools outside of Databricks. External tables are supported in multiple data formats. Once an external table is registered in a Unity Catalog metastore, you can manage and audit Databricks access to it—and work with it—just like you can with managed tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are managed Volumes ?

A

Managed volumes are fully managed by Unity Catalog, which means that Unity Catalog manages access to the volume’s storage location in your cloud provider account. When you create a managed volume, it is automatically stored in the managed storage location assigned to the containing schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are extenal volumes ?

A

External volumes represent existing data in storage locations that are managed outside of Databricks, but registered in Unity Catalog to control and audit access from within Databricks.
When you create an external volume in Databricks, you specify its location, which must be on a path that is defined in a Unity Catalog external location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can you define storage location at different levels of Unity catalog ?

A

Your organization may require that data of certain types be stored within specific accounts or buckets in your cloud tenant.

Unity Catalog gives the ability to **configure storage locations at the metastore, catalog, or schema level **to satisfy such requirements. The system evaluates the hierarchy of storage locations from schema to catalog to metastore.

For example, let’s say your organization has a company compliance policy that requires production data relating to human resources to reside in the bucket s3://mycompany-hr-prod. In Unity Catalog, you can achieve this requirement by setting a location on a catalog level, creating a catalog called, for example hr_prod, and assigning the location s3://mycompany-hr-prod/unity-catalog to it. This means that managed tables or volumes created in the hr_prod catalog (for example, using CREATE TABLE hr_prod.default.table …) **store their data in s3://mycompany-hr-prod/unity-catalog. Optionally, you can choose to provide schema-level locations to organize data within the hr_prod catalog at a more granular level.

If storage isolation is not required for some catalogs, you can optionally set a storage location at the metastore level. This location serves as a default location for managed tables and volumes in catalogs and schemas that don’t have assigned storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is workspace catalog binding ?

A

By default, catalog owners (and metastore admins, if they are defined for the account) can make a catalog accessible to users in multiple workspaces attached to the same Unity Catalog metastore. If you use workspaces to isolate user data access, however, you might want to limit catalog access to specific workspaces in your account, to ensure that certain kinds of data are processed only in those workspaces. You might want separate production and development workspaces, for example, or a separate workspace for processing personal data. This is known as** workspace-catalog binding**.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Lakehouse Federation ?

A

Lakehouse Federation is the query federation platform for Databricks. The term query federation describes a collection of features that enable users and systems to run queries against multiple siloed data sources without needing to migrate all data to a unified system.

Databricks uses Unity Catalog to manage query federation. You use Unity Catalog to configure read-only connections to popular external database systems and create foreign catalogs that mirror external databases. Unity Catalog’s data governance and data lineage tools ensure that data access is managed and audited for all federated queries made by the users in your Databricks workspaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Delta Sharing ?

A

Delta Sharing is a secure data sharing platform that lets you share data and AI assets with users outside your organization, whether or not those users use Databricks. Although Delta Sharing is available as an open-source implementation, in Databricks it requires Unity Catalog to take full advantage of extended functionality.

Databricks Marketplace, an open forum for exchanging data products, is built on top of Delta Sharing, and as such, you must have a Unity Catalog-enabled workspace to be a Marketplace provider.

17
Q

what are catalog ?

A

Catalogs are the highest level in the data hierarchy (catalog > schema > table/view/volume) managed by the Unity Catalog metastore. They are intended as the primary unit of data isolation in a typical Databricks data governance model.

Catalogs can be stored at the metastore level, or you can** configure a catalog to be stored separately from the rest of the parent metastore**. If your workspace was enabled for Unity Catalog automatically, there is no metastore-level storage, and you must specify a storage location when you create a catalog.

18
Q

What are schema ?

A

Schemas, also known as databases, are logical groupings of tabular data (tables and views), non-tabular data (volumes), functions, and machine learning models. They give you a way to organize and control access to data that is more granular than catalogs. Typically they represent a single use case, project, or team sandbox.

Schemas can be stored in the same physical storage as the parent catalog, or you can configure a schema to be stored separately from the rest of the parent catalog.

Metastore admins, parent catalog owners, and schema owners can manage access to schemas.

19
Q

what is the difference between centralize and distributed data governance model ?

A

In the centralized governance model, your governance administrators are owners of the metastore and can take ownership of any object and grant and revoke permissions.

In a distributed governance model, the catalog or a set of catalogs is the data domain. The owner of that catalog can create and own all assets and manage governance within that domain. The owners of any given domain can operate independently of the owners of other domains.

20
Q

What are External Locations and how they are used?

A

External locations allow Unity Catalog to read and write data on your cloud tenant on behalf of users. External locations are defined as a path to cloud storage, combined with a storage credential that can be used to access that location.

You can use external locations to register external tables and external volumes in Unity Catalog. The content of these entities is physically located on a sub-path in an external location that is referenced when a user creates the volume or the table.

For increased data isolation, you can bind storage credentials and external locations to specific workspaces.

You should use external locations to do the following:

Register external tables and volumes using the CREATE EXTERNAL VOLUME or CREATE TABLE commands.

Explore existing files in cloud storage before you create an external table or volume at a specific prefix. The READ FILES privilege is a precondition.

Register a location as managed storage for catalogs and schemas instead of the metastore root bucket. The CREATE MANAGED STORAGE privilege is a precondition.

More recommendations for using external locations:

Avoid path overlap conflicts: never create external volumes or tables at the root of an external location..

21
Q

Who can grant privileges on a object to other principals?

A

Each securable object in Unity Catalog has an owner. The principal that creates an object becomes its initial owner.

An object’s owner has all privileges on the object, such as SELECT and MODIFY on a table, as well as the permission to grant privileges on the securable object to other principals.

Only owners of a securable object have the permission to grant privileges on that object to other principals. Therefore, it is best practice to configure ownership on all objects to the group responsible for administration of grants on the object. Both the owner and metastore admins can transfer ownership of a securable object to a group.

Additionally, if the object is contained within a catalog (like a table or view), the catalog and schema owner can change the ownership of the object.

22
Q

What setup option does Databricks sets up when automatically enabling a Unity Workspace ?

A
  • An automatically-provisioned Unity Catalog metastore (unless a Unity Catalog metastore already existed for the workspace region).
  • Default privileges for workspace admins, such as the ability to create a catalog or an external database connection.
  • No metastore admin (unless an existing Unity Catalog metastore was used and a metastore admin was already assigned).
  • No metastore-level storage for managed tables and managed volumes (unless an existing Unity Catalog metastore with metastore-level storage was used).
  • A workspace catalog, which, when originally provisioned, is named after your workspace.
  • All users in your workspace can create assets in the** default** schema in this catalog. By default, this catalog is bound to your workspace, which means that it can only be accessed through your workspace. Automatic provisioning of the workspace catalog at workspace creation is rolling out gradually across accounts.
23
Q

What are the requirements for a compute resource to be able to connect to Unity Catalog ?

A

To run Unity Catalog workloads, compute resources must comply with certain security requirements. Non-compliant compute resources cannot access data or other objects in Unity Catalog.

SQL warehouses always comply with Unity Catalog requirements, but some cluster access modes do not.

It allows only single access and shared access mode but not the no isolation shared mode.

24
Q

What are the default privileges assigned to users in a Workspace enabled for Unity catalog ?

A

Some workspaces have default user (non-admin) privileges upon launch:

If your workspace launched with an automatically-provisioned workspace catalog, all workspace users can create objects in the workspace catalog’s default schema.

All workspace users receive the USE CATALOG privilege on the workspace catalog. Workspace users also receive the USE SCHEMA, CREATE TABLE, CREATE VOLUME, CREATE MODEL, CREATE FUNCTION, and CREATE MATERIALIZED VIEW privileges on the default schema in the catalog.

If your workspace was enabled for Unity Catalog manually, it has a main catalog provisioned automatically.

Workspace users have the USE CATALOG privilege on the main catalog, which doesn’t grant the ability to create or select from any objects in the catalog, but is a prerequisite for working with any objects in the catalog. The user who created the metastore owns the main catalog by default and can both transfer ownership and grant access to other users.

If metastore storage is added after the metastore is created, no main catalog is provisioned.

Other workspaces(non Unity catalog) have no catalogs created by default and no non-admin user privileges enabled by default. A workspace admin must create the first catalog and grant users access to it and the objects in it.

25
Q

What are the default privileges assigned to admins in a Workspace enabled for Unity catalog ?

A

Some workspaces have default workspace admin privileges upon launch:

If your workspace was enabled for Unity Catalog automatically:

Workspace admins can create new catalogs and objects in new catalogs, and grant access to them.

There is no metastore admin by default.

Workspace admins own the workspace catalog (if there is one) and can grant access to that catalog and any objects in that catalog.

If your workspace was enabled for Unity Catalog manually:

Workspace admins have no special Unity Catalog privileges by default.

Metastore admins must exist and can create any Unity Catalog object and can take ownership of any Unity Catalog object.

26
Q

How can you retain the hive metastore after enabling your workspace for Unity catalog ?

A

If your workspace has a Hive metastore that contains data that you want to continue to use, and you choose not to follow the recommendation to upgrade the tables managed by the Hive metastore to the Unity Catalog metastore, you can continue to work with data in the Hive metastore alongside data in the Unity Catalog metastore.

The Hive metastore is represented in Unity Catalog interfaces as a catalog named hive_metastore. In order to continue working with data in your Hive metastore without having to update queries to specify the hive_metastore catalog, you can set the workspace’s default catalog to hive_metastore.

27
Q

How are privileges inherited in Unity Catalog ?

A

Securable objects in Unity Catalog are hierarchical, and privileges are inherited downward. The highest level object that privileges are inherited from is the catalog. This means that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema. For example, if you give a user the SELECT privilege on a catalog, then that user will be able to select (read) all tables and views in that catalog.
Privileges that are granted on a Unity Catalog metastore are not inherited.

Owners of an object are automatically granted all privileges on that object. In addition, object owners can grant privileges on the object itself and on all of its child objects.

This means that owners of a schema do not automatically have all privileges on the tables in the schema, but they can grant themselves privileges on the tables in the schema

28
Q

Who will get the inital metastore admin when creating a workspace ?

A

If an account admin creates the metastore manually, that account admin is the metastore’s initial owner and metastore admin. All metastores created before November 8, 2023 were created manually by an account admin.

If the metastore was provisioned as part of automatic Unity Catalog enablement, the metastore was created without a metastore admin.

Workspace admins in that case are automatically granted privileges that make the metastore admin optional.

If needed, account admins can assign the metastore admin role to a user, service principal, or group. Groups are strongly recommended.

29
Q

Can the owner of a View, function or model transfer the ownership of the above objects to anyone ?

A

To prevent privilege escalations, only a metstore admin can transfer ownership of a view, function, or model to any user, service principal, or group in the account.

Current owners are restricted to transferring ownership to their username or to a group that they are member of.

30
Q
A