ExamTopics Flashcards by Erlend Eindride Fasmer

Question #5

Which of the following describes the storage organization of a Delta table?

A. Delta tables are stored in a single file that contains data, history, metadata, and other attributes.
B. Delta tables store their data in a single file and all metadata in a collection of files in a separate location.
C. Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes.
D. Delta tables are stored in a collection of files that contain only the data stored within the table.
E. Delta tables are stored in a single file that contains only the data stored within the table.

C. Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes.

How well did you know this?

Not at all

Perfectly

Which of the following Git operations must be performed outside of Databricks Repos?

A. Commit
B. Pull
C. Push
D. Clone
E. Merge

Outdated. All are possible. But earlier E. Merge was correct.

According to https://learn.microsoft.com/en-us/azure/databricks/repos/:

In Databricks Repos, you can use Git functionality to:

Clone, push to, and pull from a remote Git repository.
Create and manage branches for development work, including merging, rebasing, and resolving conflicts.
Create notebooks—including IPYNB notebooks—and edit them and other files.
Visually compare differences upon commit and resolve merge conflicts.

How well did you know this?

Not at all

Perfectly

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The cade block used by the data engineer is below:

(spark. table (“sales”)
.withcolumn (“avg_price”, col (“sales”) / col (“units”) )
.writestream
. option (“checkpointLocation”, checkpointpath)
. outputMode (“complete”)
.____
.table(“new_sales”)

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

A. trigger(“5 seconds”)
B. trigger()
C. trigger(once=”5 seconds”)
D. trigger(processingTime=”5 seconds”)
E. trigger(continuous=”5 seconds”)

D. trigger(processingTime=”5 seconds”)

How well did you know this?

Not at all

Perfectly

A dataset has been defined using Delta Live Tables and includes an expectations clause:
CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’) ON VIOLATION DROP ROW
What is the expected behavior when a batch of data containing data that violates these constraints is processed?

A. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
B. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.
C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
E. Records that violate the expectation cause the job to fail.

C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

How well did you know this?

Not at all

Perfectly

Which of the following describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL?

A. CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static.
B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.
C. CREATE STREAMING LIVE TABLE is redundant for DLT and it does not need to be used.
D. CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations.
E. CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static.

B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.

Bing Copilot:

The correct answer is B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally. According to the documentation1, streaming tables allow you to process a growing dataset, handling each row only once. Streaming tables are optimal for pipelines that require data freshness and low latency. Streaming tables can also be useful for massive scale transformations, as results can be incrementally calculated as new data arrives, keeping results up to date without needing to fully recompute all source data with each update.

The other options are incorrect for the following reasons:

A. CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static. This is not true, as streaming tables can be followed by either static or streaming steps in the pipeline. The choice of streaming or static depends on the nature of the data source and the desired latency of the results.
C. CREATE STREAMING LIVE TABLE is redundant for DLT and it does not need to be used. This is not true, as streaming tables are a distinct type of dataset in DLT that have different processing semantics than static tables. Streaming tables are useful for handling append-only sources that need to be processed in real-time.
D. CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations. This is not true, as streaming tables are not suitable for data sources that have updates, deletions, or aggregations. For these cases, materialized views should be used instead, as they can handle complex transformations and maintain data consistency.
E. CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static. This is not true, as streaming tables can be preceded by either static or streaming steps in the pipeline. The choice of streaming or static depends on the nature of the data source and the desired latency of the results.

How well did you know this?

Not at all

Perfectly

A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run.
Which of the following tools can the data engineer use to solve this problem?

A. Unity Catalog
B. Delta Lake
C. Databricks SQL
D. Data Explorer
E. Auto Loader

E. Auto Loader

How well did you know this?

Not at all

Perfectly

Which of the following Structured Streaming queries is performing a hop from a Silver table to a Gold table?

A
( spark. readstream. load (rawsa1esLocation)
. writeStream
. option ( “checkpointLocation”, checkpointpath)
. outputMode ( “ append” )
. table ( “newsales )

B
( spark. read. Ioad ( rawSaIesLocation)
. writestream
. option (“checkpointLocation”, checkpointpath)
. outputMode (“append” )
. table ( “ newSaIes “ )

C
(spark. table (“sales “ )
.withColumn (“avgPrice”,col (“sales”) / col (“units”) )
. writeStream
. option (“checkpointLocation”, checkpointpath)
. outputMode(“append”)
. table (“newsales”)

D
( spark. table (“sales”)
.filter(col(“units”) > 0)
.writestream
. option (“checkpointLocation”, checkpointPath)
. outputMode (“append”)
. table (“newsales”)
)

E
(spark. table sales”)
. groupBy (“store”)
. agg (sum (“sales”) )
. writestream
. option ( “checkpointLocation “, checkpointpath)
. outputMode(“complete” )
. table (“newSales”)

E as we’re doing an aggregation and we’re rewriting the whole table and not just appending.

How well did you know this?

Not at all

Perfectly

A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.
Which of the following approaches can the data engineer take to identify the table that is dropping the records?

A. They can set up separate expectations for each table when developing their DLT pipeline.
B. They cannot determine which table is dropping the records.
C. They can set up DLT to notify them via email when records are dropped.
D. They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.
E. They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.

D. They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.

To identify the table in a Delta Live Tables (DLT) pipeline where data is being dropped due to quality concerns, the data engineer can navigate to the DLT pipeline page, click on each table in the pipeline, and view the data quality statistics. These statistics often include information about records dropped, violations of expectations, and other data quality metrics. By examining the data quality statistics for each table in the pipeline, the data engineer can determine at which table the data is being dropped.

How well did you know this?

Not at all

Perfectly

A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.
Which of the following approaches can the data engineer use to set up the new task?

A. They can clone the existing task in the existing Job and update it to run the new notebook.
B. They can create a new task in the existing Job and then add it as a dependency of the original task.
C. They can create a new task in the existing Job and then add the original task as a dependency of the new task.
D. They can create a new job from scratch and add both tasks to run concurrently.
E. They can clone the existing task to a new Job and then edit it to run the new notebook.

B. They can create a new task in the existing Job and then add it as a dependency of the original task.

That the new task is a dependency of the original task means that it must be run before the original task.

How well did you know this?

Not at all

Perfectly

An engineering manager wants to monitor the performance of a recent project using a Databricks SQL query. For the first week following the project’s release, the manager wants the query results to be updated every minute. However, the manager is concerned that the compute resources used for the query will be left running and cost the organization a lot of money beyond the first week of the project’s release.
Which of the following approaches can the engineering team use to ensure the query does not cost the organization any money beyond the first week of the project’s release?

A. They can set a limit to the number of DBUs that are consumed by the SQL Endpoint.
B. They can set the query’s refresh schedule to end after a certain number of refreshes.
C. They cannot ensure the query does not cost the organization money beyond the first week of the project’s release.
D. They can set a limit to the number of individuals that are able to manage the query’s refresh schedule.
E. They can set the query’s refresh schedule to end on a certain date in the query scheduler.

C. They cannot ensure the query does not cost the organization money beyond the first week of the project’s release.

How well did you know this?

Not at all

Perfectly

A data analysis team has noticed that their Databricks SQL queries are running too slowly when connected to their always-on SQL endpoint. They claim that this issue is present when many members of the team are running small queries simultaneously. They ask the data engineering team for help. The data engineering team notices that each of the team’s queries uses the same SQL endpoint.
Which of the following approaches can the data engineering team use to improve the latency of the team’s queries?

A. They can increase the cluster size of the SQL endpoint.
B. They can increase the maximum bound of the SQL endpoint’s scaling range.
C. They can turn on the Auto Stop feature for the SQL endpoint.
D. They can turn on the Serverless feature for the SQL endpoint.
E. They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to “Reliability Optimized.”

A. They can increase the cluster size of the SQL endpoint.

(This is similar to question 40 in the official exam dump.)

How well did you know this?

Not at all

Perfectly

A data engineer wants to schedule their Databricks SQL dashboard to refresh once per day, but they only want the associated SQL endpoint to be running when it is necessary.

Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?

A. They can ensure the dashboard’s SQL endpoint matches each of the queries’ SQL endpoints.
B. They can set up the dashboard’s SQL endpoint to be serverless.
C. They can turn on the Auto Stop feature for the SQL endpoint.
D. They can reduce the cluster size of the SQL endpoint.
E. They can ensure the dashboard’s SQL endpoint is not one of the included query’s SQL endpoint.

Bing Copilot agrees:

The correct answer is C. They can turn on the Auto Stop feature for the SQL endpoint.

According to the documentation1, the Auto Stop feature determines whether the SQL endpoint stops if it’s idle for the specified number of minutes. Idle SQL endpoints continue to accumulate DBU and cloud instance charges until they are stopped. The default is 120 minutes, set to 0 to disable the auto stop.

The other options are not valid because:

A. Ensuring the dashboard’s SQL endpoint matches each of the queries’ SQL endpoints will not affect the running time of the SQL endpoint, but only ensure consistency and avoid errors.
B. Setting up the dashboard’s SQL endpoint to be serverless will not minimize the running time of the SQL endpoint, but only enable it to scale up and down automatically based on demand.
D. Reducing the cluster size of the SQL endpoint will not minimize the running time of the SQL endpoint, but only affect its performance and capacity.
E. Ensuring the dashboard’s SQL endpoint is not one of the included query’s SQL endpoint will not minimize the running time of the SQL endpoint, but only create unnecessary complexity and confusion.

“Warehouse settings

[…]

Auto Stop determines whether the warehouse stops if it’s idle for the specified number of minutes. Idle SQL warehouses continue to accumulate DBU and cloud instance charges until they are stopped.

– Pro and classic SQL warehouses: The default is 45 minutes, which is recommended for typical use. The minimum is 10 minutes.

– Serverless SQL warehouses: The default is 10 minutes, which is recommended for typical use. The minimum is 5 minutes when you use the UI. Note that you can create a serverless SQL warehouse using the SQL warehouses API, in which case you can set the Auto Stop value as low as 1 minute.”

https://docs.databricks.com/en/sql/admin/create-sql-warehouse.html

How well did you know this?

Not at all

Perfectly

A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to an ELT job. The ELT job has its Databricks SQL query that returns the number of input records containing unexpected NULL values. The data engineer wants their entire team to be notified via a messaging webhook whenever this value reaches 100.
Which of the following approaches can the data engineer use to notify their entire team via a messaging webhook whenever the number of NULL values reaches 100?

A. They can set up an Alert with a custom template.
B. They can set up an Alert with a new email alert destination.
C. They can set up an Alert with a new webhook alert destination.
D. They can set up an Alert with one-time notifications.
E. They can set up an Alert without notifications.

C. They can set up an Alert with a new webhook alert destination.

How well did you know this?

Not at all

Perfectly

A single Job runs two notebooks as two separate tasks. A data engineer has noticed that one of the notebooks is running slowly in the Job’s current run. The data engineer asks a tech lead for help in identifying why this might be the case.
Which of the following approaches can the tech lead use to identify why the notebook is running slowly as part of the Job?

A. They can navigate to the Runs tab in the Jobs UI to immediately review the processing notebook.
B. They can navigate to the Tasks tab in the Jobs UI and click on the active run to review the processing notebook.
C. They can navigate to the Runs tab in the Jobs UI and click on the active run to review the processing notebook.
D. There is no way to determine why a Job task is running slowly.
E. They can navigate to the Tasks tab in the Jobs UI to immediately review the processing notebook.

C. They can navigate to the Runs tab in the Jobs UI and click on the active run to review the processing notebook.

How well did you know this?

Not at all

Perfectly

A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start.
Which of the following actions can the data engineer perform to improve the start up time for the clusters used for the Job?

A. They can use endpoints available in Databricks SQL
B. They can use jobs clusters instead of all-purpose clusters
C. They can configure the clusters to be single-node
D. They can use clusters that are from a cluster pool
E. They can configure the clusters to autoscale for larger data sizes

D. They can use clusters that are from a cluster pool

Bing Copilot:

The correct answer is D. They can use clusters that are from a cluster pool.

According to the documentation1, Databricks pools are a set of idle, ready-to-use instances. When cluster nodes are created using the idle instances, cluster start and auto-scaling times are reduced. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the cluster’s request.

The other options are not valid because:

A. Using endpoints available in Databricks SQL will not improve the start up time for the clusters used for the Job, but only enable the data engineer to run SQL queries and visualize results in a web-based interface.
B. Using jobs clusters instead of all-purpose clusters will not improve the start up time for the clusters used for the Job, but only allow the data engineer to run scheduled or triggered tasks on isolated clusters.
C. Configuring the clusters to be single-node will not improve the start up time for the clusters used for the Job, but only limit the scalability and performance of the clusters.
E. Configuring the clusters to autoscale for larger data sizes will not improve the start up time for the clusters used for the Job, but only enable the clusters to dynamically add or remove nodes based on the workload.

How well did you know this?

Not at all

Perfectly

A new data engineering team team. has been assigned to an ELT project. The new data engineering team will need full privileges on the database customers to fully manage the project.
Which of the following commands can be used to grant full permissions on the database to the new data engineering team?

A. GRANT USAGE ON DATABASE customers TO team;
B. GRANT ALL PRIVILEGES ON DATABASE team TO customers;
C. GRANT SELECT PRIVILEGES ON DATABASE customers TO teams;
D. GRANT SELECT CREATE MODIFY USAGE PRIVILEGES ON DATABASE customers TO team;
E. GRANT ALL PRIVILEGES ON DATABASE customers TO team;

E. GRANT ALL PRIVILEGES ON DATABASE customers TO team;

How well did you know this?

Not at all

Perfectly

A new data engineering team has been assigned to work on a project. The team will need access to database customers in order to see what tables already exist. The team has its own group team.
Which of the following commands can be used to grant the necessary permission on the entire database to the new team?

A. GRANT VIEW ON CATALOG customers TO team;
B. GRANT CREATE ON DATABASE customers TO team;
C. GRANT USAGE ON CATALOG team TO customers;
D. GRANT CREATE ON DATABASE team TO customers;
E. GRANT USAGE ON DATABASE customers TO team;

E. GRANT USAGE ON DATABASE customers TO team;

Bing Copilot:

The correct answer is E. GRANT USAGE ON DATABASE customers TO team;

According to the documentation1, the USAGE privilege allows a principal to access an object. For example, to access a table, the principal must have the USAGE privilege on the database and schema that contain the table, as well as the SELECT privilege on the table.

The other options are not valid because:

A. GRANT VIEW ON CATALOG customers TO team; is not a valid syntax, as there is no VIEW privilege or CATALOG object in Databricks SQL.
B. GRANT CREATE ON DATABASE customers TO team; will grant the team the privilege to create new objects in the customers database, but not to see what tables already exist.
C. GRANT USAGE ON CATALOG team TO customers; is not a valid syntax, as there is no CATALOG object in Databricks SQL. Moreover, this command would grant the customers database the privilege to access the team group, which is not the intended outcome.
D. GRANT CREATE ON DATABASE team TO customers; will grant the customers database the privilege to create new objects in the team group, which is not a valid securable object or the intended outcome.

How well did you know this?

Not at all

Perfectly

A data engineer is running code in a Databricks Repo that is cloned from a central Git repository. A colleague of the data engineer informs them that changes have been made and synced to the central Git repository. The data engineer now needs to sync their Databricks Repo to get the changes from the central Git repository.

Which of the following Git operations does the data engineer need to run to accomplish this task?

A. Merge
B. Push
C. Pull
D. Commit
E. Clone

C. Pull

How well did you know this?

Not at all

Perfectly

Which of the following is a benefit of the Databricks Lakehouse Platform embracing open source technologies?

A. Cloud-specific integrations
B. Simplified governance
C. Ability to scale storage
D. Ability to scale workloads
E. Avoiding vendor lock-in

E. Avoiding vendor lock-in

How well did you know this?

Not at all

Perfectly

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.

In which of the following locations can the data engineer review their permissions on the table?

A. Databricks Filesystem
B. Jobs
C. Dashboards
D. Repos
E. Data Explorer

E. Data Explorer

How well did you know this?

Not at all

Perfectly

Which of the following describes a scenario in which a data engineer will want to use a single-node cluster?

A. When they are working interactively with a small amount of data
B. When they are running automated reports to be refreshed as quickly as possible
C. When they are working with SQL within Databricks SQL
D. When they are concerned about the ability to automatically scale with larger data
E. When they are manually running reports with a large amount of data

A. When they are working interactively with a small amount of data

How well did you know this?

Not at all

Perfectly

A data engineer has been given a new record of data:

id STRING = ‘a1’
rank INTEGER = 6
rating FLOAT = 9.4

Which of the following SQL commands can be used to append the new record to an existing Delta table my_table?

A. INSERT INTO my_table VALUES (‘a1’, 6, 9.4)
B. my_table UNION VALUES (‘a1’, 6, 9.4)
C. INSERT VALUES ( ‘a1’ , 6, 9.4) INTO my_table
D. UPDATE my_table VALUES (‘a1’, 6, 9.4)
E. UPDATE VALUES (‘a1’, 6, 9.4) my_table

A. INSERT INTO my_table VALUES (‘a1’, 6, 9.4)

How well did you know this?

Not at all

Perfectly

A data engineer has realized that the data files associated with a Delta table are incredibly small. They want to compact the small files to form larger files to improve performance.

Which of the following keywords can be used to compact the small files?

A. REDUCE
B. OPTIMIZE
C. COMPACTION
D. REPARTITION
E. VACUUM

B. OPTIMIZE

How well did you know this?

Not at all

Perfectly

In which of the following file formats is data from Delta Lake tables primarily stored?

A. Delta
B. CSV
C. Parquet
D. JSON
E. A proprietary, optimized format specific to Databricks

C. Parquet

How well did you know this?

Not at all

Perfectly

Which of the following is stored in the Databricks customer's cloud account? A. Databricks web application B. Cluster management metadata C. Repos D. Data E. Notebooks

D. Data

Which of the following can be used to simplify and unify siloed data architectures that are specialized for specific use cases? A. None of these B. Data lake C. Data warehouse D. All of these E. Data lakehouse

E. Data lakehouse

A data architect has determined that a table of the following format is necessary: |employeeId|startDate |avgRating| |1 |2021-01-01|5.5 | Which of the following code blocks uses SQL DDL commands to create an empty Delta table in the above format regardless of whether a table already exists with this name? A CREATE TABLE 1F NOT EXISTS table name ( employeeId STRING, startDate DATE, avgRating FLOAT ) B CREATE OR REPLACE TABLE table name AS SELECT employeeld STRING, startDate DALE, avgRating FLOAT USING DELTA C CREATE OR REPLACE TABLE table name WITH COLUMNS ( employeeld STR ING, startDate DAE, avgRating FLOAT ) USING DELTA D CREATE TABLE table name AS SELECT employeeld STRING, startDate DATE, avgRating FLOAT E CREATE OR REPLACE TABLE table_name ( employeeld STRING, start Date DATE, avgRating FLOAT )

E CREATE OR REPLACE TABLE table_name ( employeeld STRING, start Date DATE, avgRating FLOAT )

A data engineer has a Python notebook in Databricks, but they need to use SQL to accomplish a specific task within a cell. They still want all of the other cells to use Python without making any changes to those cells. Which of the following describes how the data engineer can use SQL within a cell of their Python notebook? A. It is not possible to use SQL in a Python notebook B. They can attach the cell to a SQL endpoint rather than a Databricks cluster C. They can simply write SQL syntax in the cell D. They can add %sql to the first line of the cell E. They can change the default language of the notebook to SQL

D. They can add %sql to the first line of the cell

Which of the following SQL keywords can be used to convert a table from a long format to a wide format? A. TRANSFORM B. PIVOT C. SUM D. CONVERT E. WHERE

B. PIVOT Bing Copilot: The correct answer is B. PIVOT. According to the web search results, the SQL Server PIVOT operator can be used to rotate a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output123. This can be useful for converting a table from a long format to a wide format, where each row has a single value for each variable, to a format where each variable has its own column and multiple values per row. The other options are not valid because: A. TRANSFORM is not a valid SQL keyword in SQL Server. C. SUM is an aggregate function that returns the sum of values in a group, not a keyword that can change the table format. D. CONVERT is a function that converts an expression of one data type to another, not a keyword that can change the table format. E. WHERE is a clause that filters the rows based on a condition, not a keyword that can change the table format.

Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement? A. Parquet files can be partitioned B. CREATE TABLE AS SELECT statements cannot be used on files C. Parquet files have a well-defined schema D. Parquet files have the ability to be optimized E. Parquet files will become Delta tables

D. Parquet files have the ability to be optimized Bing Copilot: The correct answer is D. Parquet files have the ability to be optimized. According to the web search results, Parquet is a columnar storage format that is designed to bring efficient compression and encoding schemes, as well as support complex nested data structures1. Parquet files can be optimized by using the OPTIMIZE command, which improves the performance of subsequent read operations by coalescing small files into larger ones, and sorting the data within each file by a specified column2. The other options are not valid because: A. Parquet files can be partitioned, but so can CSV files. Partitioning is a technique that divides a table into smaller subsets based on the values of a partition column, such as date or region3. Partitioning can improve query performance and reduce costs by scanning only the relevant subsets of data. B. CREATE TABLE AS SELECT statements can be used on files, as long as the files are registered as external tables in Databricks SQL. An external table is a table that references data stored in an external system, such as Amazon S3 or Azure Data Lake Storage4. C. Parquet files have a well-defined schema, but so do CSV files. A schema is a description of the structure and types of data in a table or file. Parquet files are self-describing, meaning they embed the schema within the data itself1. CSV files can also have a schema, either inferred from the data or specified by the user5. E. Parquet files will not become Delta tables unless they are explicitly converted to Delta format. Delta tables are tables that store data in Delta Lake, which is an open-source storage layer that brings ACID transactions, time travel, and schema enforcement to big data workloads. Delta tables can be created from existing Parquet files by using the CONVERT TO DELTA command.

A data engineer wants to create a relational object by pulling data from two tables. The relational object does not need to be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data. Which of the following relational objects should the data engineer create? A. Spark SQL Table B. View C. Database D. Temporary view E. Delta Table

D. Temporary view

A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL. Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark? A. SELECT * FROM sales B. spark.delta.table C. spark.sql D. There is no way to share data between PySpark and SQL. E. spark.table

C. spark.sql

Which of the following commands will return the number of null values in the member_id column? A. SELECT count(member_id) FROM my_table; B. SELECT count(member_id) - count_null(member_id) FROM my_table; C. SELECT count_if(member_id IS NULL) FROM my_table; D. SELECT null(member_id) FROM my_table; E. SELECT count_null(member_id) FROM my_table;

C. SELECT count_if(member_id IS NULL) FROM my_table;

A data engineer needs to apply custom logic to identify employees with more than 5 years of experience in array column employees in table stores. The custom logic should create a new column exp_employees that is an array of all of the employees with more than 5 years of experience for each row. In order to apply this custom logic at scale, the data engineer wants to use the FILTER higher-order function. Which of the following code blocks successfully completes this task? A. SELECT store_id, employees, FILTER (employees,i —> i.years_exp > 5) AS exp_employees FROM stores; B. SELECT store_id, employees, FILTER (exp_employees, years_exp > 5) AS exp_employees FROM stores; C. SELECT store_id, employees, FILTER (employees,years_exp > 5) AS exp_employees FROM stores; D. SELECT store_id, employees, CASE WHEN employees.years_exp > 5 THEN employees ELSE NULL END AS exp_employees FROM stores; E. SELECT store_id, employees, FILTER (exp_employees,i —> i.years_exp > 5) AS exp_employees FROM stores;

A. SELECT store_id, employees, FILTER (employees,i —> i.years_exp > 5) AS exp_employees FROM stores;

A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name. They have the following incomplete code block: ____(f"SELECT customer_id, spend FROM {table_name}") Which of the following can be used to fill in the blank to successfully complete the task? A. spark.delta.sql B. spark.delta.table C. spark.table D. dbutils.sql E. spark.sql

E. spark.sql

A data engineer has created a new database using the following command: CREATE DATABASE IF NOT EXISTS customer360; In which of the following locations will the customer360 database be located? A. dbfs:/user/hive/database/customer360 B. dbfs:/user/hive/warehouse C. dbfs:/user/hive/customer360 D. More information is needed to determine the correct response E. dbfs:/user/hive/database

D. More information is needed to determine the correct response

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command: DROP TABLE IF EXISTS my_table; After running this command, the engineer notices that the data files and metadata files have been deleted from the file system. Which of the following describes why all of these files were deleted? A. The table was managed B. The table's data was smaller than 10 GB C. The table's data was larger than 10 GB D. The table was external E. The table did not have a location

A. The table was managed

In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command? A. When the location of the data needs to be changed B. When the target table is an external table C. When the source table can be deleted D. When the target table cannot contain duplicate records E. When the source is not a Delta table

D. When the target table cannot contain duplicate records

A data engineer is working with two tables. Each of these tables is displayed below in its entirety. The table sales: |customer_id|spend|units| |a1|28.94|7| |a3|874.12|23| |a4|8.99|1| The table favorite_stores: |customer_id|store_id| |a1|s1| |a2|s1| |a4|s2| The data engineer runs the following query to join these tables together: SELECT sales.customer_id, sales.spend, favorite_stores.store_id FROM sales LEFT JOIN favorite_stores ON sales.customer_id = favorite_stores.customer_id; What is returned?

|customer_id|spend|store_id| |a1|28.94|s1| |a3|874.12|NULL| |a4|8.99|s2| The LEFT JOIN keyword returns all records from the left table, even if there are no matches in the right table.

A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv. They run the following command: CREATE TABLE new_table ___ OPTIONS ( header = "true", delimiter = "|" ) LOCATION "path/to/csv" Which of the following lines of code fills in the above blank to successfully complete the task? A. None of these lines of code are needed to successfully complete the task B. USING CSV C. FROM CSV D. USING DELTA E. FROM "path/to/csv"

B. USING CSV

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table. The code block used by the data engineer is below: (spark.readStream .table("sales") .withColumn("avg_price", col("sales") / col ("units")) .writeStream .option("checkpointLocation",checkpointpath) .outputMode("complete") .___ .table("new_sales") ) If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank? A. processingTime(1) B. trigger(availableNow=True) C. trigger(parallelBatch=True) D. trigger(processingTime="once") E. trigger(continuous="once")

B. trigger(availableNow=True) Bing Copilot: The correct answer is B. trigger(availableNow=True). This option sets a trigger that processes all available data in multiple batches then terminates the query1. This is useful for incremental batch processing, where the query only needs to run on the new data that has arrived since the last execution2.

A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values. Which of the following describes why Auto Loader inferred all of the columns to be of the string type? A. There was a type mismatch between the specific schema and the inferred schema B. JSON data is a text-based format C. Auto Loader only works with string data D. All of the fields had at least one null value E. Auto Loader cannot infer the schema of ingested data

B. JSON data is a text-based format Bing Copilot: The correct answer is B. JSON data is a text-based format. By default, Auto Loader schema inference seeks to avoid schema evolution issues due to type mismatches. For formats that don’t encode data types (JSON and CSV), Auto Loader infers all columns as strings (including nested fields in JSON files)1. This means that even if some fields only contain float or boolean values, they will be treated as strings by Auto Loader. The other options are incorrect for the following reasons: A. There was no specific schema provided by the data engineer, so there was no type mismatch between the specific schema and the inferred schema. C. Auto Loader can work with different data types, but it requires either enabling inferSchema or using cloudFiles.inferColumnTypes with Auto Loader2. D. Having null values in some fields does not affect the type inference of Auto Loader. Null values are handled by the option cloudFiles.rescuedDataColumn2. E. Auto Loader can infer the schema of ingested data, but it uses a conservative approach for JSON and CSV formats to avoid type conflicts1.

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE. The table is configured to run in Development mode using the Continuous Pipeline Mode. Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline? A. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated. B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down. C. All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused. D. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing. E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

According to ExamTopics: B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down. I think: E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing. Because the similar question 36 from the official exam dump has D as the given official anwer: Question 36 A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE. The table is configured to run in Development mode using the Triggered Pipeline Mode. Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline? A. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated. B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped. C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist after the pipeline is stopped to allow for additional testing. D. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing. E. All datasets will be updated continuously and the pipeline will not shut down. The compute resources will persist with the pipeline.

Which of the following data workloads will utilize a Gold table as its source? A. A job that enriches data by parsing its timestamps into a human-readable format B. A job that aggregates uncleaned data to create standard summary statistics C. A job that cleans data by removing malformatted records D. A job that queries aggregated data designed to feed into a dashboard E. A job that ingests raw data from a streaming source into the Lakehouse

D. A job that queries aggregated data designed to feed into a dashboard

Which of the following must be specified when creating a new Delta Live Tables pipeline? A. A key-value pair configuration B. The preferred DBU/hour cost C. A path to cloud storage location for the written data D. A location of a target database for the written data E. At least one notebook library to be executed

E. At least one notebook library to be executed Correct. See the Course "Databricks Certified Data Engineer Associate Exam Guide" with Ankit Mistry, video 71.

A data engineer has joined an existing project and they see the following query in the project repository: CREATE STREAMING LIVE TABLE loyal_customers AS SELECT customer_id - FROM STREAM(LIVE.customers) WHERE loyalty_level = 'high'; Which of the following describes why the STREAM function is included in the query? A. The STREAM function is not needed and will cause an error. B. The table being created is a live table. C. The customers table is a streaming live table. D. The customers table is a reference to a Structured Streaming query on a PySpark DataFrame. E. The data in the customers table has been updated since its last run.

C. The customers table is a streaming live table.

Which of the following describes the type of workloads that are always compatible with Auto Loader? A. Streaming workloads B. Machine learning workloads C. Serverless workloads D. Batch workloads E. Dashboard workloads

A. Streaming workloads

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables. Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables? A. None of these changes will need to be made B. The pipeline will need to stop using the medallion-based multi-hop architecture C. The pipeline will need to be written entirely in SQL D. The pipeline will need to use a batch source in place of a streaming source E. The pipeline will need to be written entirely in Python

B according to ExamTopics, but I suggest A. Bing Copilot: The correct answer is A. None of these changes will need to be made. This is because: Delta Live Tables supports the medallion-based multi-hop architecture, which segments data lake data into raw, bronze, silver, and gold layers. Delta Live Tables can help manage the data quality, schema evolution, and orchestration of each layer1. Delta Live Tables allows you to write your data transformations using either SQL or Python. You can mix and match SQL and Python in the same pipeline, depending on your preference and use case2. Delta Live Tables can handle both batch and streaming sources. You can use the STREAMING LIVE TABLE keyword to define a streaming table that processes a growing dataset, handling each row only once2.

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table: transactions_df = (spark.read .schema(schema) .format("delta") .table("transactions") ) Which of the following changes needs to be made so this code block will work when the transactions table is a stream source? A. Replace predict with a stream-friendly prediction function B. Replace schema(schema) with option ("maxFilesPerTrigger", 1) C. Replace "transactions" with the path to the location of the Delta table D. Replace format("delta") with format("stream") E. Replace spark.read with spark.readStream

E. Replace spark.read with spark.readStream

A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE What is the expected behavior when a batch of data containing data that violates these constraints is processed? A. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log. B. Records that violate the expectation cause the job to fail. C. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table. D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. E. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.

B. Records that violate the expectation cause the job to fail.

Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true? A. Silver tables contain a less refined, less clean view of data than Bronze data. B. Silver tables contain aggregates while Bronze data is unaggregated. C. Silver tables contain more data than Bronze tables. D. Silver tables contain a more refined and cleaner view of data than Bronze tables. E. Silver tables contain less data than Bronze tables.

D. Silver tables contain a more refined and cleaner view of data than Bronze tables.

A data engineering team has noticed that their Databricks SQL queries are running too slowly when they are submitted to a non-running SQL endpoint. The data engineering team wants this issue to be resolved. Which of the following approaches can the team use to reduce the time it takes to return results in this scenario? A. They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to "Reliability Optimized." B. They can turn on the Auto Stop feature for the SQL endpoint. C. They can increase the cluster size of the SQL endpoint. D. They can turn on the Serverless feature for the SQL endpoint. E. They can increase the maximum bound of the SQL endpoint's scaling range.

D. They can turn on the Serverless feature for the SQL endpoint. Bing Copilot: The correct answer is D. They can turn on the Serverless feature for the SQL endpoint. This is because: The Serverless feature for Databricks SQL enables instant compute and eliminates the need to wait for clusters to start or stop1. Serverless SQL endpoints use compute clusters that are managed by Databricks and run in the serverless data plane2. This means that the queries can run faster and more efficiently, without wasting time or resources on cluster management. The other options are incorrect for the following reasons: A. Changing the Spot Instance Policy to “Reliability Optimized” does not affect the query performance, as it only affects the cluster availability and cost3. Moreover, the Serverless feature does not support spot instances, as it uses on-demand instances only4. B. Turning on the Auto Stop feature for the SQL endpoint does not reduce the query time, as it only stops the cluster after a period of inactivity. This can save costs, but it does not improve performance. C. Increasing the cluster size of the SQL endpoint can improve the query performance, but it does not eliminate the cluster start-up time, which is the main issue in this scenario. Moreover, increasing the cluster size can also increase the costs and complexity of cluster management. E. Increasing the maximum bound of the SQL endpoint’s scaling range can improve the query performance, but it does not eliminate the cluster start-up time, which is the main issue in this scenario. Moreover, increasing the scaling range can also increase the costs and complexity of cluster management.

A data engineer has a Job that has a complex run schedule, and they want to transfer that schedule to other Jobs. Rather than manually selecting each value in the scheduling form in Databricks, which of the following tools can the data engineer use to represent and submit the schedule programmatically? A. pyspark.sql.types.DateType B. datetime C. pyspark.sql.types.TimestampType D. Cron syntax E. There is no way to represent and submit this information programmatically

D. Cron syntax

Which of the following approaches should be used to send the Databricks Job owner an email in the case that the Job fails? A. Manually programming in an alert system in each cell of the Notebook B. Setting up an Alert in the Job page C. Setting up an Alert in the Notebook D. There is no way to notify the Job owner in the case of Job failure E. MLflow Model Registry Webhooks

B. Setting up an Alert in the Job page

An engineering manager uses a Databricks SQL query to monitor ingestion latency for each data source. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results. Which of the following approaches can the manager use to ensure the results of the query are updated each day? A. They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL. B. They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL. C. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL. D. They can schedule the query to run every 1 day from the Jobs UI. E. They can schedule the query to run every 12 hours from the Jobs UI.

C. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL. https://learn.microsoft.com/en-us/azure/databricks/sql/user/queries/schedule-query

In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task? A. When another task needs to be replaced by the new task B. When another task needs to fail before the new task begins C. When another task has the same dependency libraries as the new task D. When another task needs to use as little compute resources as possible E. When another task needs to successfully complete before the new task begins

E. When another task needs to successfully complete before the new task begins

A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to a data analytics dashboard for a retail use case. The job has a Databricks SQL query that returns the number of store-level records where sales is equal to zero. The data engineer wants their entire team to be notified via a messaging webhook whenever this value is greater than 0. Which of the following approaches can the data engineer use to notify their entire team via a messaging webhook whenever the number of stores with $0 in sales is greater than zero? A. They can set up an Alert with a custom template. B. They can set up an Alert with a new email alert destination. C. They can set up an Alert with one-time notifications. D. They can set up an Alert with a new webhook alert destination. E. They can set up an Alert without notifications.

D. They can set up an Alert with a new webhook alert destination.

A data engineer wants to schedule their Databricks SQL dashboard to refresh every hour, but they only want the associated SQL endpoint to be running when it is necessary. The dashboard has multiple queries on multiple datasets associated with it. The data that feeds the dashboard is automatically processed using a Databricks Job. Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard? A. They can turn on the Auto Stop feature for the SQL endpoint. B. They can ensure the dashboard's SQL endpoint is not one of the included query's SQL endpoint. C. They can reduce the cluster size of the SQL endpoint. D. They can ensure the dashboard's SQL endpoint matches each of the queries' SQL endpoints. E. They can set up the dashboard's SQL endpoint to be serverless.

E. They can set up the dashboard's SQL endpoint to be serverless. Bing Copilot agrees: The correct answer is E. They can set up the dashboard’s SQL endpoint to be serverless. This is because: According to the web search results, the Serverless feature for Databricks SQL enables instant compute and eliminates the need to wait for clusters to start or stop1. Serverless SQL endpoints use compute clusters that are managed by Databricks and run in the serverless data plane2. This means that the queries can run faster and more efficiently, without wasting time or resources on cluster management. The other options are incorrect for the following reasons: A. Turning on the Auto Stop feature for the SQL endpoint does not reduce the total running time of the SQL endpoint, as it only stops the cluster after a period of inactivity. This can save costs, but it does not improve performance. B. Ensuring the dashboard’s SQL endpoint is not one of the included query’s SQL endpoint does not reduce the total running time of the SQL endpoint, as it only creates more complexity and inconsistency in the dashboard configuration. Moreover, it may cause errors or conflicts if the queries use different warehouses or databases. C. Reducing the cluster size of the SQL endpoint does not reduce the total running time of the SQL endpoint, as it only reduces the parallelism and throughput of the query execution. This may actually degrade the performance and increase the latency of the queries. D. Ensuring the dashboard’s SQL endpoint matches each of the queries’ SQL endpoints does not reduce the total running time of the SQL endpoint, as it only ensures that all queries use the same cluster and warehouse. This may improve consistency and simplicity, but it does not affect performance.

A data engineer needs access to a table new_table, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is. Which of the following approaches can be used to identify the owner of new_table? A. Review the Permissions tab in the table's page in Data Explorer B. All of these options can be used to identify the owner of the table C. Review the Owner field in the table's page in Data Explorer D. Review the Owner field in the table's page in the cloud storage solution E. There is no way to identify the owner of the table

C. Review the Owner field in the table's page in Data Explorer https://docs.databricks.com/en/data/explore-tables.html

A new data engineering team team has been assigned to an ELT project. The new data engineering team will need full privileges on the table sales to fully manage the project. Which of the following commands can be used to grant full permissions on the database to the new data engineering team? A. GRANT ALL PRIVILEGES ON TABLE sales TO team; B. GRANT SELECT CREATE MODIFY ON TABLE sales TO team; C. GRANT SELECT ON TABLE sales TO team; D. GRANT USAGE ON TABLE sales TO team; E. GRANT ALL PRIVILEGES ON TABLE team TO sales;

A. GRANT ALL PRIVILEGES ON TABLE sales TO team;

ExamTopics Flashcards

(60 cards)