Databricks - Transform data with PySpark Flashcards
A data engineer has created the following Spark DataFrame sales_df that joins the previously created table sales with the Spark DataFrame items_df when sales and items_df have matching values in the sales_id column in both data objects.
Code block:
sales_df = (spark
.table(“sales”)
.withColumn(“item”, explode(“items”))
)
items_df = spark.table(“item_lookup”)
item_purchasesDF = (sales_df ______________________)
Which of the following lines of code correctly fills in the blank? Select one response.
A) .join(items_df, sales.sales_id == items_df.sales_id, how = “cross”)
B) .innerJoin(items_df, sales_df.sales_id == items_df.sales_id)
C) .merge(items_df, sales_df, on = “item_id”)
D) .join(items_df, sales.sales_id == items_df.sales_id)
E) .outerJoin(items_df, sales.sales_id == items_df.sales_id)
D) .join(items_df, sales.sales_id == items_df.sales_id)
A data engineer has a table high_temps with the following schema, where avg_high_temp represents the monthly average high temperatures for each unique year-month combination.
year string
month string
avg_high_temp string
They need to reformat the data with years as the primary record key and months as the columns. The existing average high temperature value for each year-month combination needs to be in the month columns.
How can the data engineer accomplish this? Select one response.
A) The data engineer can rotate the data from long to wide format using the .transform()clause.
B) The data engineer can rotate the data from wide to long format using the .transform() clause.
C) The data engineer can rotate the data from wide to long format using the .pivot() function.
D) The data engineer can rotate the data from long to wide format using the .pivot() function.
E) The data engineer can rotate the data from long to wide format using the .groupBy()clause.
D) The data engineer can rotate the data from long to wide format using the .pivot() function.
Which of the following statements about querying tables defined against external sources is true? Select one response.
A) None of these statements about external table behavior are true.
B) When defining tables or queries against external data sources, the storage path, external location, and storage credential are displayed for users who have been granted USAGE access to the table.
C) When defining tables or queries against external data sources, older cached versions of the table are automatically added to the event log.
D) When defining tables or queries against external data sources, older cached versions of the table are automatically deleted.
E) When defining tables or queries against external data sources, the performance guarantees associated with Delta Lake and Lakehouse cannot be guaranteed.
E) When defining tables or queries against external data sources, the performance guarantees associated with Delta Lake and Lakehouse cannot be guaranteed.
A data engineer needs to query a JSON file whose location is represented by the variable path.
Which of the following commands do they need to use? Select one response.
A) SELECT * FROM path LOCATION ${path}
;
B) SELECT * FROM json.${path}
;
C) DISPLAY TABLE json.${path}
;
D) RETURN json.${path}
;
E) SHOW TABLE json.${path}
;
B) SELECT * FROM json.${path}
;
A data engineer has a query that directly updates the files underlying the external table emails.
Which of the following correctly describes how to retrieve the number of rows in the updated table? Select one response.
A) REFRESH TABLE emails;
SELECT COUNT() FROM emails AS OF VERSION 1;
B) REFRESH TABLE emails;
SELECT COUNT() FROM emails WHEN UPDATED = TRUE;
C) REFRESH TABLE emails;
SELECT DISTINCT_COUNT() FROM emails AS OF VERSION 1;
D) REFRESH TABLE emails;
SELECT DISTINCT_COUNT() FROM emails;
E) REFRESH TABLE emails;
SELECT COUNT(*) FROM emails;
E) REFRESH TABLE emails;
SELECT COUNT(*) FROM emails;
A senior data engineer has registered the Python function create_users to be used by the rest of their team. They have written the function in Python 3.6. Another data engineer wants to know what the expected inputs are.
Which of the following is considered a best practice to do this? Select one response.
A) The data engineer can use type hints to clarify the input and return types of the function.
B) The data engineer can add the input and output types to the table using @options()
C) The data engineer can declare the function with @udf(“string”) syntax to specify the input and return types of the function.
D) The data engineer can add a return string line to the end of their UDF to specify the input and return types of the function.
E) The data engineer can add a comment to the table properties to clarify the input and return types of the function.
A) The data engineer can use type hints to clarify the input and return types of the function.
A data engineer needs to create and register a user-defined function (UDF) CREATE_USER using the Python function createUser and apply it to array column username in the table users. They have the following lines of code.
Lines of code:
1 - spark.sql(“SELECT createUser(username) AS user FROM users”)
2 - spark.sql(“SELECT CREATE_USER(username) AS user FROM users”)
3 - spark.udf.register(“CREATE_USER”, createUser)
4 - spark.udf.register(createUser(username))
In what order do the lines of code above need to be run in a Python session in order to accomplish this? Select one response.
A) 3,2
B) 2
C) 4,2
D) 4,1
E) 3,1
A) 3,2
A data engineer has a DataFrame events_df that has been registered against an external JSON file. The nested JSON fields have already been converted into struct types. The data engineer now needs to flatten the struct fields back into individual columns for the field event_type. The events_df DataFrame has the following schema:
date string
month string
event_type StructType<id string, size int>
Which of the following approaches allows the data engineer to retrieve id within event_type? Select one response.
A) They can use from_json() to parse the columns for id.
B) They can index the DataFrame by id.
C) They can use event_type.* to pull out id into its own column.
D) They can use . syntax to access id in event_type.
E) They can use : syntax to access id in event_type.
D) They can use . syntax to access id in event_type.
A data engineer has created a custom Python function that returns an input variable of type double after it has been incremented by 1.
Which of the following identifies the most efficient type of UDF the engineer can use in terms of complexity, as well as the correct method to specify the type of UDF? Select one response.
A) The data engineer can specify that the UDF has row-at-a-time execution using @pandas_udf(“double”, PandasUDFType.SCALAR) syntax.
B) The data engineer can specify that the UDF is vectorized using type hints.
C) The data engineer can specify that the UDF is vectorized using @pandas_udf(“double”, PandasUDFType.VECTORIZED) syntax.
D) The data engineer can specify that the UDF is vectorized using @pandas_udf(“double”, PandasUDFType.SCALAR) syntax.
E) The data engineer can specify that the UDF has row-at-a-time execution using @pandas_udf(“double”, PandasUDFType.VECTORIZED) syntax.
D) The data engineer can specify that the UDF is vectorized using @pandas_udf(“double”, PandasUDFType.SCALAR) syntax.
A data engineer is registering a table in Databricks using the table users from an external SQL database. One of their colleagues gives them the following code to register the table. However, when the data engineer runs the code, they notice an error.
Code block:
CREATE TABLE users_jdbc
USING JDBC
OPTIONS (
url = “jdbc:sqlite:${DA.paths.ecommerce_db}”
)
Which of the following correctly identifies why running the code is resulting in an error? Select one response.
A) CREATE TABLE needs to be changed to CREATE JDBC TABLE.
B) The line dbtable = “users” needs to be added to OPTIONS.
C) A username and password need to be added to OPTIONS.
D) USING JDBC needs to be changed to USING SQL.
E) None of these responses correctly identify the cause of the error.
B) The line dbtable = “users” needs to be added to OPTIONS.
A data engineer is using the following code block to create a table using an external CSV file as its source. They need to specify that the fields are separated by | and that there is a header.
Code block:
CREATE TABLE IF NOT EXISTS sales_csv
(order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)
USING CSV _____
LOCATION “${dbfs:/mnt/datasets}”
Which of the following correctly fills in the blank for the table options? Select one response.
A) VALUES (
header = “true”,
delimiter = “|”)
B) KEYS (
header = “true”,
delimiter = “|”)
C) TBLPROPERTIES (
header = “true”,
delimiter = “|”)
D) OPTIONS (
header = “true”,
delimiter = “|”)
E) COMMENTS (
header = “true”,
delimiter = “|”)
D) OPTIONS (
header = “true”,
delimiter = “|”)
A data engineer needs to extract the calendar date and time in human readable format from a DataFrame containing the timestamp column user_last_touch_timestamp.
Which of the following lines of code correctly fills in the blank by adding the column end_date of type date in human readable format? Select one response.
A) .withColumn(“end_date”, date_format(“user_last_touch_timestamp”, “MMM d, yyyy”))
B) .withColumn(date_format(“end_date”), user_last_touch_timestamp, “HH:mm:ss”)
C) .withColumn(date_time(“end_date”), user_last_touch_timestamp, “MMM d, yyyy”)
D) .withColumn(“end_date”, CAST(user_last_touch_timestamp) as date_format)
E).withColumn(date_time(“end_date”),user_last_touch_timestamp, “HH:mm:ss”)
A) .withColumn(“end_date”, date_format(“user_last_touch_timestamp”, “MMM d, yyyy”))
A data engineer is using the following query to confirm that each unique string value in the phone_number column in the usersDF DataFrame is associated with at most one user_id. They want the query to return true if each phone_number is associated with at most 1 user_id. When they run the query, they notice that the query is not returning the expected result.
Code block:
from pyspark.sql.functions import countDistinct
usersDF
.groupBy(“phone_number”)
.agg(countDistinct(“user_id”).alias(“unique_user_ids”))
Which of the following explains why the query is not returning the expected result? Select one response.
A) .groupBy(“phone_number”) needs to be changed to count(*).when(user_id != null).
B) .groupBy(“phone_number”) needs to be changed to .countDistinct(phone_number).
C) A .dropDuplicates() statement needs to be added after the .agg() function.
D) A .merge statement on row_count == count(phone_number) needs to be added after the groupBy() function.
E) A .select(max(“unique_user_ids”) <= 1)function needs to be added after the .agg() function.
E) A .select(max(“unique_user_ids”) <= 1)function needs to be added after the .agg() function.
A data engineer needs a reference to the results of a query that can be referenced across multiple queries within the scope of the environment session. The data engineer does not want the reference to exist outside of the scope of the environment session.
Which of the following approaches accomplishes this without explicitly dropping the data object? Select one response.
A) They can store the results of their query within a view.
B) They can store the results of their query within a temporary view.
C) They can store the results of their query within a table.
D) They can store the results of their query within a common table expression (CTE).
E) They can store the results of their query within a reusable user-defined function (UDF).
B) They can store the results of their query within a temporary view.
Which of the following identifies a performance challenge associated with Python UDFs? Select three responses
A) Python UDFs have interprocess communication overhead between the executor and a Python interpreter running on each worker node.
B) Python UDFs cannot be optimized by the Catalyst Optimizer.
C) Python UDFs cannot use vector or scalar transformations.
D) Python UDFs cannot contain column aggregations or data optimizations.
E) Python UDFs have to deserialize row data from Spark’s native binary format to pass to the UDF, and the results are serialized back into Spark’s native format.
A) Python UDFs have interprocess communication overhead between the executor and a Python interpreter running on each worker node.
B) Python UDFs cannot be optimized by the Catalyst Optimizer.
E) Python UDFs have to deserialize row data from Spark’s native binary format to pass to the UDF, and the results are serialized back into Spark’s native format.