Transform Data with PySpark (Databricks Q) Flashcards
A data engineer has created a DataFrame exploded_eventsDF created from the table exploded_events defined here:
CREATE TABLE events (user_id string, event_name string, item_id string, events struct<coupon:string, event_id:string, event_revenue:double>);
They are using the following code with multiple array transformations to return a new DataFrame that shows the unique collection of the columns event_name and items.
Code block:
from pyspark.sql.functions import array_distinct, collect_set, flatten
exploded_eventsDF
.groupby("user_id") .agg(collect_set("event_name"), \_\_\_\_\_
Which of the following lines of code fills in the blank to create the column event_history as a unique collection of events? Select one response.
flatten(array_distinct(events[event_id])).alias(“event_history”)
flatten(extract(events.event_id)).alias(“event_history”)
array_distinct(flatten(collect_set(“events.event_id”))).alias(“event_history”)
array_distinct(extract(collect_set(events.event_id))).alias(“event_history”)
flatten(collect_set(explode(events:event_id))).alias(“event_history”)
array_distinct(flatten(collect_set(“events.event_id”))).alias(“event_history”)
A data engineer has created the following Spark DataFrame sales_df that joins the previously created table sales with the Spark DataFrame items_df when sales and items_df have matching values in the sales_id column in both data objects.
Code block:
sales_df = (spark
.table("sales") .withColumn("item", explode("items"))
)
items_df = spark.table(“item_lookup”)
item_purchasesDF = (sales_df
______________________)
Which of the following lines of code correctly fills in the blank? Select one response.
.merge(items_df, sales_df, on = “item_id”)
.outerJoin(items_df, sales.sales_id == items_df.sales_id)
.join(items_df, sales.sales_id == items_df.sales_id, how = “cross”)
.innerJoin(items_df, sales_df.sales_id == items_df.sales_id)
.join(items_df, sales.sales_id == items_df.sales_id)
.join(items_df, sales.sales_id == items_df.sales_id)
Which of the following statements about the difference between views and temporary views are correct? Select two responses.
Temporary views do not contain a preserved schema. Views are tied to a system preserved temporary schema global_temp.
Temporary views have names that must be qualified. Views have names that must be unique.
Temporary views reside in the third layer of Unity Catalog’s three-level namespace Views lie in the metastore.
Temporary views are session-scoped and dropped when the Spark session ends. Views can be accessed after the session ends.
Temporary views skip persisting the definition in the underlying metastore. Views have metadata that can be accessed in the view’s directory.
Temporary views are session-scoped and dropped when the Spark session ends. Views can be accessed after the session ends.
Temporary views skip persisting the definition in the underlying metastore. Views have metadata that can be accessed in the view’s directory.
A data engineer is using the following code block to create a table using an external CSV file as its source. They need to specify that the fields are separated by | and that there is a header.
Code block:
CREATE TABLE IF NOT EXISTS sales_csv
(order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)
USING CSV
_____
LOCATION “${dbfs:/mnt/datasets}”
Which of the following correctly fills in the blank for the table options? Select one response.
COMMENTS (
header = “true”,
delimiter = “|”
)
OPTIONS (
header = “true”,
delimiter = “|”
)
VALUES (
header = “true”,
delimiter = “|”
)
KEYS (
header = “true”,
delimiter = “|”
)
TBLPROPERTIES (
header = “true”,
delimiter = “|”
)
OPTIONS (
header = “true”,
delimiter = “|”
)
A data engineer has created a custom Python function that returns an input variable of type double after it has been incremented by 1.
Which of the following identifies the most efficient type of UDF the engineer can use in terms of complexity, as well as the correct method to specify the type of UDF? Select one response.
The data engineer can specify that the UDF is vectorized using type hints.
The data engineer can specify that the UDF is vectorized using @pandas_udf(“double”, PandasUDFType.SCALAR) syntax.
The data engineer can specify that the UDF has row-at-a-time execution using @pandas_udf(“double”, PandasUDFType.VECTORIZED) syntax.
The data engineer can specify that the UDF is vectorized using @pandas_udf(“double”, PandasUDFType.VECTORIZED) syntax.
The data engineer can specify that the UDF has row-at-a-time execution using @pandas_udf(“double”, PandasUDFType.SCALAR) syntax.
The data engineer can specify that the UDF is vectorized using @pandas_udf(“double”, PandasUDFType.SCALAR) syntax.
Which of the following statements about querying tables defined against external sources is true? Select one response.
When defining tables or queries against external data sources, older cached versions of the table are automatically added to the event log.
When defining tables or queries against external data sources, older cached versions of the table are automatically deleted.
None of these statements about external table behavior are true.
When defining tables or queries against external data sources, the performance guarantees associated with Delta Lake and Lakehouse cannot be guaranteed.
When defining tables or queries against external data sources, the storage path, external location, and storage credential are displayed for users who have been granted USAGE access to the table.
When defining tables or queries against external data sources, the performance guarantees associated with Delta Lake and Lakehouse cannot be guaranteed.
Which of the following identifies a performance challenge associated with Python UDFs? Select three responses
Python UDFs cannot contain column aggregations or data optimizations.
Python UDFs cannot use vector or scalar transformations.
Python UDFs cannot be optimized by the Catalyst Optimizer.
Python UDFs have to deserialize row data from Spark’s native binary format to pass to the UDF, and the results are serialized back into Spark’s native format.
Python UDFs have interprocess communication overhead between the executor and a Python interpreter running on each worker node.
Python UDFs cannot be optimized by the Catalyst Optimizer.
Python UDFs have to deserialize row data from Spark’s native binary format to pass to the UDF, and the results are serialized back into Spark’s native format.
Python UDFs have interprocess communication overhead between the executor and a Python interpreter running on each worker node.
A data engineer has a DataFrame events_df that has been registered against an external JSON file. They need to access the field date within events_df. The events_df DataFrame has the following schema:
date string
month string
event_type string
Which of the following approaches can the data engineer use to accomplish this? Select one response.
They can use . syntax to access date in events_df.
They can index the query by subfield using events[date] syntax.
They can use date.* to pull out the subfields of events_df into their own columns.
They can use from_json() to parse the column for date.
They can use : syntax to access date in events_df.
They can use : syntax to access date in events_df.
Which of the following lines of code counts null values in the column email from the DataFrame usersDF? Select two responses.
usersDF.selectExpr(“count_if(email IS NULL)”)
usersDF.distinct()
usersDF.drop()
usersDF.where(col(“email”).isNull()).count()
usersDF.count().dropna()
usersDF.selectExpr(“count_if(email IS NULL)”)
usersDF.where(col(“email”).isNull()).count()
A data engineer has a table high_temps with the following schema, where avg_high_temp represents the monthly average high temperatures for each unique year-month combination.
year string
month string
avg_high_temp string
They need to reformat the data with years as the primary record key and months as the columns. The existing average high temperature value for each year-month combination needs to be in the month columns.
How can the data engineer accomplish this? Select one response.
The data engineer can rotate the data from wide to long format using the .pivot() function.
The data engineer can rotate the data from long to wide format using the .pivot() function.
The data engineer can rotate the data from long to wide format using the .groupBy()clause.
The data engineer can rotate the data from long to wide format using the .transform()clause.
The data engineer can rotate the data from wide to long format using the .transform() clause.
The data engineer can rotate the data from long to wide format using the .pivot() function.
A data engineer is registering a table in Databricks using the table users from an external SQL database. One of their colleagues gives them the following code to register the table. However, when the data engineer runs the code, they notice an error.
Code block:
CREATE TABLE users_jdbc
USING JDBC
OPTIONS (
url = “jdbc:sqlite:${DA.paths.ecommerce_db}”
)
Which of the following correctly identifies why running the code is resulting in an error? Select one response.
A username and password need to be added to OPTIONS.
USING JDBC needs to be changed to USING SQL.
None of these responses correctly identify the cause of the error.
CREATE TABLE needs to be changed to CREATE JDBC TABLE.
The line dbtable = “users” needs to be added to OPTIONS.
The line dbtable = “users” needs to be added to OPTIONS.
A data engineer has a DataFrame events_df that has been registered against an external JSON file. The nested JSON fields have already been converted into struct types. The data engineer now needs to flatten the struct fields back into individual columns for the field event_type. The events_df DataFrame has the following schema:
date string
month string
event_type StructType<id string, size int>
Which of the following approaches allows the data engineer to retrieve id within event_type? Select one response.
They can use event_type.* to pull out id into its own column.
They can index the DataFrame by id.
They can use : syntax to access id in event_type.
They can use . syntax to access id in event_type.
They can use from_json() to parse the columns for id.
They can use . syntax to access id in event_type.
A data engineer needs a reference to the results of a query that can be referenced across multiple queries within the scope of the environment session. The data engineer does not want the reference to exist outside of the scope of the environment session.
Which of the following approaches accomplishes this without explicitly dropping the data object? Select one response.
They can store the results of their query within a temporary view.
They can store the results of their query within a table.
They can store the results of their query within a view.
They can store the results of their query within a common table expression (CTE).
They can store the results of their query within a reusable user-defined function (UDF).
They can store the results of their query within a temporary view.
A data engineer needs to query a JSON file whose location is represented by the variable path.
Which of the following commands do they need to use? Select one response.
SHOW TABLE json.${path}
;
DISPLAY TABLE json.${path}
;
RETURN json.${path}
;
SELECT * FROM path LOCATION ${path}
;
SELECT * FROM json.${path}
;
SELECT * FROM json.${path}
;
A senior data engineer has registered the Python function create_users to be used by the rest of their team. They have written the function in Python 3.6. Another data engineer wants to know what the expected inputs are.
Which of the following is considered a best practice to do this? Select one response.
The data engineer can declare the function with @udf(“string”) syntax to specify the input and return types of the function.
The data engineer can add the input and output types to the table using @options()
The data engineer can add a comment to the table properties to clarify the input and return types of the function.
The data engineer can use type hints to clarify the input and return types of the function.
The data engineer can add a return string line to the end of their UDF to specify the input and return types of the function.
The data engineer can use type hints to clarify the input and return types of the function.