Manage Data with Delta Lake Flashcards

1
Q

A data engineer needs to atomically append new rows to an existing Delta table.

Which of the following approaches is considered best practice to efficiently modify the table? Select one response.

A) The data engineer can use UPDATE to update the existing tables in one batch.
B) The data engineer can use APPEND to update the existing tables in one batch.
C) The data engineer can use INSERT INTO to incrementally update the existing tables.
D) The data engineer can use INSERT OVERWRITE to incrementally update the existing tables.
E) The data engineer can use INSERT ONLY to incrementally update the existing tables.

A

C) The data engineer can use INSERT INTO to incrementally update the existing tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following statements about managed and external tables are true? Select two responses.

A) When dropping an external table, the underlying data and metadata are also deleted.
B) External tables will always specify a LOCATION during table creation.
C) Managed tables are specified with the CREATE MANAGED TABLE command in SQL.
D) When dropping a managed table, only the underlying metadata stays intact.
E) When moving a managed table to a new database, the table’s data must be written to the new location.

A

B) External tables will always specify a LOCATION during table creation.

E) When moving a managed table to a new database, the table’s data must be written to the new location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A data engineer is trying to optimize the result set returned in their query by compacting their data files to avoid small files.

Which keyword do they need to use in their query to improve the query’s performance while keeping the data files as even as possible with respect to size on disk? Select one response.

A) OPTIMIZE
B) VACUUM
C) ZORDER
D) COMPACT
E) DATASKIPPING

A

A) OPTIMIZE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A data engineer is using the code shown below to replace data from the table sales with data from a new query. However, the query isn’t running as expected.

Code block:

INSERT INTO sales

SELECT *, current_timestamp() FROM parquet ${da.paths.datasets}/ecommerce/raw/sales-historical

Which of the following statements correctly explains why the query is not running as expected? Select one response.

A) None of the provided answer choices explain why the query is running incorrectly.
B) The source file path is formatted incorrectly. Double-quotes need to be used in place of back-ticks.
C) INSERT OVERWRITE needs to be used instead of INSERT INTO.
D) APPEND needs to be used instead of INSERT INTO.
E) MERGE INTO needs to be used instead of INSERT INTO.

A

C) INSERT OVERWRITE needs to be used instead of INSERT INTO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A data engineer needs to create a table with additional metadata columns. The columns need to specify the timestamp at which the table creation query was executed and the source data file for each record in the table.

Which of the following built-in Spark SQL commands can the data engineer use in their query to add these columns? Select two responses.

A) from_unixtime()
B) input_file_block_start()
C) from_utc_timestamp()
D) current_timestamp()
E) input_file_name()

A

D) current_timestamp()
E) input_file_name()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A data engineer needs to create the new database clients at a location represented by the variable path. The database will only contain JSON files.

Which of the following commands does the data engineer need to run to complete this task? Select two responses.

A) CREATE DATABASE clients LOCATION ‘${path}’;
B) CREATE SCHEMA clients LOCATION ‘${path}’;
C) CREATE DATABASE clients DELTA json. ‘${path}’;
D) CREATE SCHEMA IF NOT EXISTS clients json. ‘${path}’;
E) CREATE DATABASE IF NOT EXISTS clients ‘${path}’;

A

A) CREATE DATABASE clients LOCATION ‘${path}’;
B) CREATE SCHEMA clients LOCATION ‘${path}’;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A data engineer wants to review the changes that other team members have made to the table cities, including the operations that were performed on the table and the time at which they were performed. The variable path represents the file path to the table.

Which of the following commands does the data engineer need to use? Select one response.

A) display(dbutils.fs.ls(f”{path}”))
B) DESCRIBE DETAIL cities;
C) DESCRIBE HISTORY cities;
D) DESCRIBE EXTENDED cities;
E) SELECT * FROM cities VERSION AS OF 1;

A

C) DESCRIBE HISTORY cities;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A data engineer is trying to improve the performance of their query by colocating records on a common filter column to reduce the number of files that need to be read. The data engineer notices that the column user_id, which contains only unique values, is used in all of their query predicates.

Which optimization technique does the data engineer need to use? Select one response.

A) The data engineer needs to use COLOCATE to colocate records on user_id.
B) The data engineer needs to use DATASKIPPING to colocate records on user_id.
C) The data engineer needs to use OPTIMIZE to colocate records on user_id.
D) The data engineer needs to use ZORDER to colocate records on user_id.
E) The data engineer needs to use VACUUM to colocate records on user_id.

A

D) The data engineer needs to use ZORDER to colocate records on user_id.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A data engineer is trying to create the generated column date in their table. However, when they run their query to create the table, they notice an error in the following line of code.

Line of code:

date DATE GENERATED ALWAYS AS (
cast(cast(transaction_timestamp/1e6) AS DATE)))

Which of the following statements correctly describes the error? Select one response.

A) transaction_timestamp needs to be converted to datetime format before it is cast as a date.
B) The ALWAYS keyword needs to be removed to account for improperly formatted data.
C) transaction_timestamp needs to be cast as a timestamp before it is cast as a date.
D) transaction_timestamp needs to be converted to an integer before it is cast as a date.
E) The DATE GENERATED ALWAYS AS command already casts transaction_timestamp to a date, so the AS DATE cast needs to be removed.

A

C) transaction_timestamp needs to be cast as a timestamp before it is cast as a date.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which of the following conditions must be met for data to be loaded incrementally with the COPY INTO command? Select three responses.

A) The data cannot contain duplicate records.
B) COPY INTO must target an existing Delta table.
C) The source file must specify the file’s format.
D) The schema for the data must be defined.
E) The data must be in JSON or CSV format.

A

B) COPY INTO must target an existing Delta table.
C) The source file must specify the file’s format.
E) The data must be in JSON or CSV format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A data engineer wants to create an empty Delta table called student if it hasn’t already been created.

Which of the following will create a new table named student regardless of whether another table with the same name has already been created? Select one response.

A) CREATE TABLE IF NOT EXISTS student AS SELECT * FROM student
B) CREATE OR REPLACE TABLE student (id INT, name STRING, age INT);
C) OVERWRITE TABLE student (id INT, name STRING, age INT);
D) CREATE TABLE IF NOT EXISTS student (id INT, name STRING, age INT);
E) CREATE TABLE student (id INT, name STRING, age INT);

A

B) CREATE OR REPLACE TABLE student (id INT, name STRING, age INT);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A data engineer needs to query a Delta table to extract rows that all meet the same condition. However, they notice that the query is running slowly, and that the data files used in the query are extremely small.

Which of the following techniques can the data engineer use to improve the performance of the query? Select one response.

A) They can perform vacuuming and data skipping in the query using the VACUUM and DATASKIPPING commands.
B) They can perform file compaction and Z-order indexing in the query using the OPTIMIZE and ZORDER commands.
C) They can perform file compaction and vacuuming in the query using the COMPACT and VACUUM commands.
D) They can perform file compaction and Z-order indexing in the query using the COMPACT and ZORDER commands.
E) They can perform data skipping and file compaction in the query using the DATASKIPPING and OPTIMIZE commands

A

B) They can perform file compaction and Z-order indexing in the query using the OPTIMIZE and ZORDER commands.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which of the following statements about vacuuming with Delta Lake is true? Select one response.

A) Delta table data files are vacuumed according to their modification timestamps on the storage system.
B) VACUUM will not vacuum any directories that begin with an underscore except for _delta_log.
C) Running VACUUM on a Delta table eliminates the ability to time travel back to a version older than the specified data retention period
D) Delta table metadata files will not be vacuumed unless the auto retention check is turned off.
E) On Delta tables, Databricks automatically triggers VACUUM operations as data is written.

A

C) Running VACUUM on a Delta table eliminates the ability to time travel back to a version older than the specified data retention period

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which of the following table modifications can be made with a MERGE INTO statement? Select three responses.

A) Write a stream of schema changes into a table
B) Write raw data from a file location into a schema
C) Write data that generates multiple downstream tables
D) Write streaming aggregates in Update Mode
E) Write data to a table with automatic deduplication

A

A) Write a stream of schema changes into a table
D) Write streaming aggregates in Update Mode
E) Write data to a table with automatic deduplication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A data engineer needs to undo changes made to the table foods. They need to ensure that the second version of the table does not include the changes before restoring the table back to that state.

Lines of code:

1) SELECT * FROM foods VERSION AS OF 2;
2) REFRESH TABLE foods;
3) SELECT * FROM foods WHERE version_number() == 2;
4) RESTORE TABLE foods TO VERSION AS OF 2;

A) 3,2
B) 1,4
C) 4
D) 2
E) 1,2

A

B) 1,4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A data engineer has a collection of tables. They need to manually remove old data files from the tables and remove access to previous versions of the tables.

Which of the following approaches allows the data engineer to do this? Select one response.

A) They need to disable the last commit version in session and enable vacuum duration check. Then they need to Z-order the table.
B) They need to disable the retention duration check and enable the last commit version in session. Then they need to vacuum the table.
C) They need to enable the retention duration check and disable vacuum logging. Then they need to Z-order the table.
D) They need to disable the retention duration check and enable vacuum logging. Then they need to vacuum the table.
E) They need to enable the retention duration check and vacuum logging. Then they need to optimize the table.

A

D) They need to disable the retention duration check and enable vacuum logging. Then they need to vacuum the table.

17
Q

data engineer wants to make changes to a very large table. They want to test their changes on a similar data object before modifying or copying the original table’s associated data.

Which of the following keywords can be used to create a similar data object that can be used for testing while meeting the above requirements? Select one response.

A) SHALLOW CLONE
B) UPDATE
C) CLONE
D) COPY
E) DEEP CLONE

A

A) SHALLOW CLONE

18
Q

Which of the following pieces of information about a table are located within the schema directory of the table? Select three responses.

A) Creation date
B) Owner
C) Catalog name
D) Last modification date
E) Location

A

B) Owner
C) Catalog name
E) Location

19
Q

Which of the following SQL commands can be used to remove a schema (database) at a specified location? Select two responses.

A) DROP DATABASE
B) DELETE SCHEMA
C) DELETE DATABASE
D) DROP SCHEMA
E) REMOVE DATABASE

A

A) DROP DATABASE
D) DROP SCHEMA

20
Q

The code block shown below should add a constraint to the table transaction_dates where only records from after ‘2022-10-01’ can be added to the table. The column date represents the date the records were created.

Code block:
__1__ transaction_dates __2__ valid_date __3__;

Which of the following correctly fills in the numbered blanks within the code block to complete this task? Select one response.
A)
1. ALTER TABLE
2. ADD CONSTRAINT
3. CHECK (date > ‘2022-10-01’)

B)
1. ALTER TABLE
2. ADD CONSTRAINT
3. WHERE (date > ‘2022-10-01’)

C)
1. ALTER TABLE
2. CONSTRAINT
3. UPDATE WHEN (date > ‘2022-10-01’)

D)
1. ALTER TABLE
2. DROP CONSTRAINT
3. NOT NULL (date > ‘2022-10-01’)

E)
1. ALTER TABLE
2. CONSTRAINT
3. (date > ‘2022-10-01’)

A

A)
1. ALTER TABLE
2. ADD CONSTRAINT
3. CHECK (date > ‘2022-10-01’)

21
Q

A data engineer is working with the table products. They want to identify the location of products and read its metadata, including the table’s format and the date that the table was created at.

Which of the following commands do they need to use? Select one response.

A) DESCRIBE TABLE products;
B) DESCRIBE HISTORY products;
C) SHOW TABLES products;
D) DESCRIBE DETAIL products;
E) DESCRIBE TABLE EXTENDED products;

A

D) DESCRIBE DETAIL products;

22
Q

Which of the following problems are solved by the guarantee of ACID transactions? Select two responses.

A) ACID transactions are guaranteed to either succeed or fail completely, so jobs will never fail mid way.
B) ACID transactions guarantee that appends will not fail due to conflict, even when writing from multiple sources at the same time.
C) ACID transactions guarantee the use of proprietary storage formats.
D) ACID transactions combine compute and storage scaling to reduce costs.
E) ACID transactions support the creation of interactive visualization queries.

A

A) ACID transactions are guaranteed to either succeed or fail completely, so jobs will never fail mid way.
B) ACID transactions guarantee that appends will not fail due to conflict, even when writing from multiple sources at the same time.

23
Q

Which of the following describes a feature of Delta Lake that is unavailable in a traditional data warehouse? Select two responses.

A) Built-in monitoring for scheduled queries
B) Centralized repository to share features
C) Combined batch and streaming analytics
D) Auto Loader for data ingestion of raw files
E) Experiment tracking and model management

A

C) Combined batch and streaming analytics
D) Auto Loader for data ingestion of raw files

24
Q

A data engineer needs to update the table people. The engineer only wants records to be updated if the existing row has a NULL value in the column email and the new row does not.

They have the following incomplete code block:

MERGE INTO people a
USING people_update b
ON a.people_id = b.people_id
_____
WHEN NOT MATCHED THEN DELETE

Which of the following statements correctly fills in the blank? Select one response.

A) WHEN MATCHED AND a.email IS NULL THEN INSERT (email = b.email)
B) UPDATE SET email = b.email WHEN a.email IS NULL
C) UPDATE SET email = b.email WHEN MATCHED AND a.email IS NULL
D) WHEN MATCHED AND a.email IS NULL THEN UPDATE SET email = b.email
E) INSERT (email = b.email) WHEN MATCHED AND a.email IS NULL

A

D) WHEN MATCHED AND a.email IS NULL THEN UPDATE SET email = b.email

25
Q

Which of the following validates that the temporary view trees has exactly six records? Select one response.

A) assert spark(“trees”).count() == 6
B) assert spark.view(“trees”).count() == 6
C) assert spark.temp(“trees”).count() == 6
D) assert spark.table(“trees”).count() == 6
E) assert spark.load(“trees”).count() == 6

A

D) assert spark.table(“trees”).count() == 6