Manage Data with Delta Lake Flashcards
A data engineer needs to atomically append new rows to an existing Delta table.
Which of the following approaches is considered best practice to efficiently modify the table? Select one response.
A) The data engineer can use UPDATE to update the existing tables in one batch.
B) The data engineer can use APPEND to update the existing tables in one batch.
C) The data engineer can use INSERT INTO to incrementally update the existing tables.
D) The data engineer can use INSERT OVERWRITE to incrementally update the existing tables.
E) The data engineer can use INSERT ONLY to incrementally update the existing tables.
C) The data engineer can use INSERT INTO to incrementally update the existing tables.
Which of the following statements about managed and external tables are true? Select two responses.
A) When dropping an external table, the underlying data and metadata are also deleted.
B) External tables will always specify a LOCATION during table creation.
C) Managed tables are specified with the CREATE MANAGED TABLE command in SQL.
D) When dropping a managed table, only the underlying metadata stays intact.
E) When moving a managed table to a new database, the table’s data must be written to the new location.
B) External tables will always specify a LOCATION during table creation.
E) When moving a managed table to a new database, the table’s data must be written to the new location.
A data engineer is trying to optimize the result set returned in their query by compacting their data files to avoid small files.
Which keyword do they need to use in their query to improve the query’s performance while keeping the data files as even as possible with respect to size on disk? Select one response.
A) OPTIMIZE
B) VACUUM
C) ZORDER
D) COMPACT
E) DATASKIPPING
A) OPTIMIZE
A data engineer is using the code shown below to replace data from the table sales with data from a new query. However, the query isn’t running as expected.
Code block:
INSERT INTO sales
SELECT *, current_timestamp() FROM parquet ${da.paths.datasets}/ecommerce/raw/sales-historical
Which of the following statements correctly explains why the query is not running as expected? Select one response.
A) None of the provided answer choices explain why the query is running incorrectly.
B) The source file path is formatted incorrectly. Double-quotes need to be used in place of back-ticks.
C) INSERT OVERWRITE needs to be used instead of INSERT INTO.
D) APPEND needs to be used instead of INSERT INTO.
E) MERGE INTO needs to be used instead of INSERT INTO.
C) INSERT OVERWRITE needs to be used instead of INSERT INTO.
A data engineer needs to create a table with additional metadata columns. The columns need to specify the timestamp at which the table creation query was executed and the source data file for each record in the table.
Which of the following built-in Spark SQL commands can the data engineer use in their query to add these columns? Select two responses.
A) from_unixtime()
B) input_file_block_start()
C) from_utc_timestamp()
D) current_timestamp()
E) input_file_name()
D) current_timestamp()
E) input_file_name()
A data engineer needs to create the new database clients at a location represented by the variable path. The database will only contain JSON files.
Which of the following commands does the data engineer need to run to complete this task? Select two responses.
A) CREATE DATABASE clients LOCATION ‘${path}’;
B) CREATE SCHEMA clients LOCATION ‘${path}’;
C) CREATE DATABASE clients DELTA json. ‘${path}’;
D) CREATE SCHEMA IF NOT EXISTS clients json. ‘${path}’;
E) CREATE DATABASE IF NOT EXISTS clients ‘${path}’;
A) CREATE DATABASE clients LOCATION ‘${path}’;
B) CREATE SCHEMA clients LOCATION ‘${path}’;
A data engineer wants to review the changes that other team members have made to the table cities, including the operations that were performed on the table and the time at which they were performed. The variable path represents the file path to the table.
Which of the following commands does the data engineer need to use? Select one response.
A) display(dbutils.fs.ls(f”{path}”))
B) DESCRIBE DETAIL cities;
C) DESCRIBE HISTORY cities;
D) DESCRIBE EXTENDED cities;
E) SELECT * FROM cities VERSION AS OF 1;
C) DESCRIBE HISTORY cities;
A data engineer is trying to improve the performance of their query by colocating records on a common filter column to reduce the number of files that need to be read. The data engineer notices that the column user_id, which contains only unique values, is used in all of their query predicates.
Which optimization technique does the data engineer need to use? Select one response.
A) The data engineer needs to use COLOCATE to colocate records on user_id.
B) The data engineer needs to use DATASKIPPING to colocate records on user_id.
C) The data engineer needs to use OPTIMIZE to colocate records on user_id.
D) The data engineer needs to use ZORDER to colocate records on user_id.
E) The data engineer needs to use VACUUM to colocate records on user_id.
D) The data engineer needs to use ZORDER to colocate records on user_id.
A data engineer is trying to create the generated column date in their table. However, when they run their query to create the table, they notice an error in the following line of code.
Line of code:
date DATE GENERATED ALWAYS AS (
cast(cast(transaction_timestamp/1e6) AS DATE)))
Which of the following statements correctly describes the error? Select one response.
A) transaction_timestamp needs to be converted to datetime format before it is cast as a date.
B) The ALWAYS keyword needs to be removed to account for improperly formatted data.
C) transaction_timestamp needs to be cast as a timestamp before it is cast as a date.
D) transaction_timestamp needs to be converted to an integer before it is cast as a date.
E) The DATE GENERATED ALWAYS AS command already casts transaction_timestamp to a date, so the AS DATE cast needs to be removed.
C) transaction_timestamp needs to be cast as a timestamp before it is cast as a date.
Which of the following conditions must be met for data to be loaded incrementally with the COPY INTO command? Select three responses.
A) The data cannot contain duplicate records.
B) COPY INTO must target an existing Delta table.
C) The source file must specify the file’s format.
D) The schema for the data must be defined.
E) The data must be in JSON or CSV format.
B) COPY INTO must target an existing Delta table.
C) The source file must specify the file’s format.
E) The data must be in JSON or CSV format.
A data engineer wants to create an empty Delta table called student if it hasn’t already been created.
Which of the following will create a new table named student regardless of whether another table with the same name has already been created? Select one response.
A) CREATE TABLE IF NOT EXISTS student AS SELECT * FROM student
B) CREATE OR REPLACE TABLE student (id INT, name STRING, age INT);
C) OVERWRITE TABLE student (id INT, name STRING, age INT);
D) CREATE TABLE IF NOT EXISTS student (id INT, name STRING, age INT);
E) CREATE TABLE student (id INT, name STRING, age INT);
B) CREATE OR REPLACE TABLE student (id INT, name STRING, age INT);
A data engineer needs to query a Delta table to extract rows that all meet the same condition. However, they notice that the query is running slowly, and that the data files used in the query are extremely small.
Which of the following techniques can the data engineer use to improve the performance of the query? Select one response.
A) They can perform vacuuming and data skipping in the query using the VACUUM and DATASKIPPING commands.
B) They can perform file compaction and Z-order indexing in the query using the OPTIMIZE and ZORDER commands.
C) They can perform file compaction and vacuuming in the query using the COMPACT and VACUUM commands.
D) They can perform file compaction and Z-order indexing in the query using the COMPACT and ZORDER commands.
E) They can perform data skipping and file compaction in the query using the DATASKIPPING and OPTIMIZE commands
B) They can perform file compaction and Z-order indexing in the query using the OPTIMIZE and ZORDER commands.
Which of the following statements about vacuuming with Delta Lake is true? Select one response.
A) Delta table data files are vacuumed according to their modification timestamps on the storage system.
B) VACUUM will not vacuum any directories that begin with an underscore except for _delta_log.
C) Running VACUUM on a Delta table eliminates the ability to time travel back to a version older than the specified data retention period
D) Delta table metadata files will not be vacuumed unless the auto retention check is turned off.
E) On Delta tables, Databricks automatically triggers VACUUM operations as data is written.
C) Running VACUUM on a Delta table eliminates the ability to time travel back to a version older than the specified data retention period
Which of the following table modifications can be made with a MERGE INTO statement? Select three responses.
A) Write a stream of schema changes into a table
B) Write raw data from a file location into a schema
C) Write data that generates multiple downstream tables
D) Write streaming aggregates in Update Mode
E) Write data to a table with automatic deduplication
A) Write a stream of schema changes into a table
D) Write streaming aggregates in Update Mode
E) Write data to a table with automatic deduplication
A data engineer needs to undo changes made to the table foods. They need to ensure that the second version of the table does not include the changes before restoring the table back to that state.
Lines of code:
1) SELECT * FROM foods VERSION AS OF 2;
2) REFRESH TABLE foods;
3) SELECT * FROM foods WHERE version_number() == 2;
4) RESTORE TABLE foods TO VERSION AS OF 2;
A) 3,2
B) 1,4
C) 4
D) 2
E) 1,2
B) 1,4