[WIP] Missing examtopics Qs Flashcards

Question 1

Q

91) Which data lakehouse feature results in improved data quality over a traditional data lake?

A. A data lakehouse stores data in open formats.
B. A data lakehouse allows the use of SQL queries to examine data.
C. A data lakehouse provides storage solutions for structured and unstructured data.
D. A data lakehouse supports ACID-compliant transactions

Answer

A

D. A data lakehouse supports ACID-compliant transactions

Las demás respuestas no tienen que ver con mantener calidad de los datos

Para ampliar sobre el uso de ACID en lakehouses, cito de Databricks Certified Data Engineer Associate Study Guide:
> While data lakes are excellent solutions for storing massive volumes of diverse data, they often encounter several challenges related to data inconsistency, and performance issues.
> The primary factor behind these limitations is the absence of ACID transactions support in the lake. ACID, an acronym for Atomicity, Consistency, Isolation, and Durability, represents fundamental rules that ensure operations on the data are reliably executed. This absence made it difficult to ensure data integrity, leading to issues like partially committed data or failed transactions.

> What makes Delta Lake an innovative solution is its ability to overcome such challenges posed by traditional data lakes. Delta Lake provides ACID transaction guarantees for data manipulation operations in the lake. It offers transactional capabilities that enable performing data operations in an atomic and consistent manner. This ensures that there is no partially committed data; either all operations within a transaction are completed successfully, or none of them are. These capabilities allow you to build reliable data lakes that ensure data integrity, consistency, and durability.

Fuente: Databricks Certified Data Engineer Associate Study Guide capítulo 1, página 6

Question 2

Q

92) In which scenario will a data team want to utilize cluster pools?

A. An automated report needs to be version-controlled by all stakeholders
B. An automated report needs to be runnable by all stakeholders
C. An automated report needs to be refreshed as quickly as possible
D. An automated report needs to be made reproducible

Answer

A

C. An automated report needs to be refreshed as quickly as possible

Por descarte: las cluster pools son conjuntos de instancias warmupeadas (listas para usar), y sirven para reducir tiempos de arranque y de auto-scaling. Por lo que el uso o no de cluster pools no tiene relación ni con versionado, ni con reproducibilidad (A, B+D respectivamente).

En la documentación de databricks, de acuerdo a su último update (29/10/2024), se recomienda serverless compute como mejor alternativa en cuanto a tiempos de arranque y escalabilidad; pero hay que tener en cuenta que tiene restricciones distintas a las cluster pools. Copio la nota de los docs de dbks:

> Note
If your workload supports serverless compute, Databricks recommends using serverless compute instead of pools to take advantage of always-on, scalable compute. See Connect to serverless compute.

Fuente: https://docs.databricks.com/en/compute/pool-index.html

Question 3

Q

93) What is hosted completely in the control plane of the classic Databricks architecture?

A. Worker node
B. Databricks web application
C. Driver node
D. Databricks Filesystem

Answer

A

B. Databricks web application

Cito de la documentación (https://docs.databricks.com/en/getting-started/overview.html):
The control plane includes the backend services that Databricks manages in your Databricks account. The web application is in the control plane.

Los worker/driver nodes de los clusters se ocupan del cómputo; lo refereido al cómputo se encuentra en el Compute Plane. El Databricks Filesystem (DBFS) esta mencionado en la documentación como parte del Workspace storage bucket (que es un bucket+prefijo de S3 que son asociados al workspace con este fin).

Fuente: https://docs.databricks.com/en/getting-started/overview.html

Question 4

Q

94) A data engineer needs to determine whether to use the built-in Databricks Notebooks versioning or version their project using Databrics Repos

A. Databricks Repos allows users to revert to previous versions of a notebook
B. Databricks Repos is wholly house within the Databricks Data Intelligence Platform
C. Databricks Repos provides the ability to comment on specific changes
D. Databricks Repos supports the use of multiple branches

Answer

A

D. Databricks Repos supports the use of multiple branches

Las Databrick Notebooks tienen versionado y permiten dejar comentarios^1. Ademáß, nativamente funcionan con la Databricks Data Intelligence Platform^2. No se menciona que provean funcionalidad de branches (los dbks repos sí).

Fuentes:
1 https://docs.databricks.com/en/notebooks/index.html
2 https://www.databricks.com/product/collaborative-notebooks

Question 5

Q

95) What is a benefit of the Databricks Lakehouse Architecture embracing open source technologies?

A. Avoiding vendor lock-in
B. Simplified governance
C. Ability to scale workloads
D. Cloud-specific integrations

Answer

A

A. Avoiding vendor lock-in

Las tecnologías open source pueden ser estudiadas y mantenidas por cualquier persona en lugar de una compañia específica, por lo que naturalmente son opuestas al vendor lock-in.
Las demás no tienen que ver con usar tecnologías open-source o no.

Question 6

Q

96) A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropiate permissions.

In which location can the data engineer review their permissions on the table?

A. Jobs
B. Dashboards
C. Catalog Explorer
D. Repos

Answer

A

C. Catalog Explorer

En el Catalog Explorer, tab Permissions se pueden ver los permisos concedidos sobre la tabla. En la página 15 de Databricks Certified Data Engineer Associate Study Guide podemos ver un screenshot del Catalog Explorer.

Me parece que hoy en día se renombró a Data Explorer? No sé si son lo mismo. En caso de que sí, en 4:38 del video 6.36 se puede ver el uso del data explorer para conceder y consultar permisos.

Question 7

Q

97) A data engineer is running code in a Databricks Repo that is cloned from a central Git repository. A colleague of the data engineer informs them that changes have been made and synced to the central Git repository. The dataa engineer now needs to sync their Databricks Repo to get the changes from the central Git Repository.

Which Git operation does the data engineer need to run to accomplish this task?

A. Clone
B. Pull
C. Merge
D. Push

Question 8

Q

98) Which file format is used for storing Delta Lake Tables?

A. CSV
B. Parquet
C. JSON
D. Delta

Answer

A

B. Parquet

Cito de la documentación de databricks
> Delta Lake is the optimized storage layer that provides the foundation for tables in a lakehouse on Databricks. Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.

Como nota, se puede cargar información de csvs, parquets, json y otros formatos. Pero internamente la info se termina guardando en parquet, y a eso apunta la pregunta.

Fuente: https://docs.databricks.com/en/delta/index.html

Question 9

Q

99) A data architect has determined that a table of the following format is necessary:

| employeeId | startDate  | avgRating |
| ---------- | -----------| --------- |
| a1         | 2009-01-06 | 5.5       |
| a2         | 2018-11-21 | 7.1       |

What code block is used by SQL DDL command to create an empty Delta table in the above format regardless of whether a table already exists with this name?

A. CREATE OR REPLACE TABLE table_name (employeeid STRING, startDate DATE, avgRating FLOAT)
B. CREATE OR REPLACE TABLE table_name WITH COLUMNS (employeeId STRING, startDate DATE, avgRating FLOAT) USING DELTA
C. CREATE TABLE IF NOT EXISTS table_name (employeeId STRING, startDate DATE, avgRating FLOAT)
D. CREATE TABLE table_name AS SELECT employeeId STRING, startDate DATE, avgRating FLOAT

Answer

A

A. CREATE OR REPLACE TABLE table_name (employeeid STRING, startDate DATE, avgRating FLOAT)

Necesitamos
- CREATE OR REPLACE para reemplazar la tabla que existe (if not exists no ejecuta el comando si la tabla es preexistente, create table a secas falla)
- por sintaxis, lo que sigue al table_name debe ser el schema entre paréntesis
- Podemos poner o no USING DELTA al final, es el valor por defecto. Pero la respuesta B dice WITH COLUMNS antes de especificar el schema, lo cual no es sintaxis correcta.

Fuente: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html

[WIP] Missing examtopics Qs Flashcards

Las preguntas de examtopics que no estaban en el .html de Martin pero sí en el pdf de examtopics, excluyendo repetidas con el set anterior (9 cards)