Transform Data with PySpark (Databricks Q) Flashcards by sidney stein

A data engineer has created a DataFrame exploded_eventsDF created from the table exploded_events defined here:

CREATE TABLE events (user_id string, event_name string, item_id string, events struct<coupon:string, event_id:string, event_revenue:double>);

They are using the following code with multiple array transformations to return a new DataFrame that shows the unique collection of the columns event_name and items.

Code block:

from pyspark.sql.functions import array_distinct, collect_set, flatten

exploded_eventsDF

.groupby("user_id")

.agg(collect_set("event_name"),

 \_\_\_\_\_

Which of the following lines of code fills in the blank to create the column event_history as a unique collection of events? Select one response.

flatten(array_distinct(events[event_id])).alias(“event_history”)

flatten(extract(events.event_id)).alias(“event_history”)

array_distinct(flatten(collect_set(“events.event_id”))).alias(“event_history”)

array_distinct(extract(collect_set(events.event_id))).alias(“event_history”)

flatten(collect_set(explode(events:event_id))).alias(“event_history”)

array_distinct(flatten(collect_set(“events.event_id”))).alias(“event_history”)

How well did you know this?

Not at all

Perfectly

A data engineer has created the following Spark DataFrame sales_df that joins the previously created table sales with the Spark DataFrame items_df when sales and items_df have matching values in the sales_id column in both data objects.

Code block:

sales_df = (spark

.table("sales")

.withColumn("item", explode("items"))

)

items_df = spark.table(“item_lookup”)

item_purchasesDF = (sales_df

______________________)

Which of the following lines of code correctly fills in the blank? Select one response.

.merge(items_df, sales_df, on = “item_id”)

.outerJoin(items_df, sales.sales_id == items_df.sales_id)

.join(items_df, sales.sales_id == items_df.sales_id, how = “cross”)

.innerJoin(items_df, sales_df.sales_id == items_df.sales_id)

.join(items_df, sales.sales_id == items_df.sales_id)

How well did you know this?

Not at all

Perfectly

Which of the following statements about the difference between views and temporary views are correct? Select two responses.

Temporary views do not contain a preserved schema. Views are tied to a system preserved temporary schema global_temp.

Temporary views have names that must be qualified. Views have names that must be unique.

Temporary views reside in the third layer of Unity Catalog’s three-level namespace Views lie in the metastore.

Temporary views are session-scoped and dropped when the Spark session ends. Views can be accessed after the session ends.

Temporary views skip persisting the definition in the underlying metastore. Views have metadata that can be accessed in the view’s directory.

Temporary views are session-scoped and dropped when the Spark session ends. Views can be accessed after the session ends.

Temporary views skip persisting the definition in the underlying metastore. Views have metadata that can be accessed in the view’s directory.

How well did you know this?

Not at all

Perfectly

A data engineer is using the following code block to create a table using an external CSV file as its source. They need to specify that the fields are separated by | and that there is a header.

Code block:

CREATE TABLE IF NOT EXISTS sales_csv

(order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)

USING CSV

_____

LOCATION “${dbfs:/mnt/datasets}”

Which of the following correctly fills in the blank for the table options? Select one response.

OPTIONS (
header = “true”,
delimiter = “|”
)

How well did you know this?

Not at all

Perfectly

A data engineer has created a custom Python function that returns an input variable of type double after it has been incremented by 1.

Which of the following identifies the most efficient type of UDF the engineer can use in terms of complexity, as well as the correct method to specify the type of UDF? Select one response.

The data engineer can specify that the UDF is vectorized using type hints.

The data engineer can specify that the UDF is vectorized using @pandas_udf(“double”, PandasUDFType.SCALAR) syntax.

The data engineer can specify that the UDF has row-at-a-time execution using @pandas_udf(“double”, PandasUDFType.VECTORIZED) syntax.

The data engineer can specify that the UDF is vectorized using @pandas_udf(“double”, PandasUDFType.VECTORIZED) syntax.

The data engineer can specify that the UDF has row-at-a-time execution using @pandas_udf(“double”, PandasUDFType.SCALAR) syntax.

The data engineer can specify that the UDF is vectorized using @pandas_udf(“double”, PandasUDFType.SCALAR) syntax.

How well did you know this?

Not at all

Perfectly

Which of the following statements about querying tables defined against external sources is true? Select one response.

When defining tables or queries against external data sources, older cached versions of the table are automatically added to the event log.

When defining tables or queries against external data sources, older cached versions of the table are automatically deleted.

None of these statements about external table behavior are true.

When defining tables or queries against external data sources, the performance guarantees associated with Delta Lake and Lakehouse cannot be guaranteed.

When defining tables or queries against external data sources, the storage path, external location, and storage credential are displayed for users who have been granted USAGE access to the table.

When defining tables or queries against external data sources, the performance guarantees associated with Delta Lake and Lakehouse cannot be guaranteed.

How well did you know this?

Not at all

Perfectly

Which of the following identifies a performance challenge associated with Python UDFs? Select three responses

Python UDFs cannot contain column aggregations or data optimizations.

Python UDFs cannot use vector or scalar transformations.

Python UDFs cannot be optimized by the Catalyst Optimizer.

Python UDFs have to deserialize row data from Spark’s native binary format to pass to the UDF, and the results are serialized back into Spark’s native format.

Python UDFs have interprocess communication overhead between the executor and a Python interpreter running on each worker node.

Python UDFs cannot be optimized by the Catalyst Optimizer.

Python UDFs have to deserialize row data from Spark’s native binary format to pass to the UDF, and the results are serialized back into Spark’s native format.

Python UDFs have interprocess communication overhead between the executor and a Python interpreter running on each worker node.

How well did you know this?

Not at all

Perfectly

A data engineer has a DataFrame events_df that has been registered against an external JSON file. They need to access the field date within events_df. The events_df DataFrame has the following schema:

date string
month string
event_type string

Which of the following approaches can the data engineer use to accomplish this? Select one response.

They can use . syntax to access date in events_df.

They can index the query by subfield using events[date] syntax.

They can use date.* to pull out the subfields of events_df into their own columns.

They can use from_json() to parse the column for date.

They can use : syntax to access date in events_df.

How well did you know this?

Not at all

Perfectly

Which of the following lines of code counts null values in the column email from the DataFrame usersDF? Select two responses.

usersDF.selectExpr(“count_if(email IS NULL)”)

usersDF.distinct()

usersDF.drop()

usersDF.where(col(“email”).isNull()).count()

usersDF.count().dropna()

usersDF.selectExpr(“count_if(email IS NULL)”)

usersDF.where(col(“email”).isNull()).count()

How well did you know this?

Not at all

Perfectly

A data engineer has a table high_temps with the following schema, where avg_high_temp represents the monthly average high temperatures for each unique year-month combination.

year string
month string
avg_high_temp string

They need to reformat the data with years as the primary record key and months as the columns. The existing average high temperature value for each year-month combination needs to be in the month columns.

How can the data engineer accomplish this? Select one response.

The data engineer can rotate the data from wide to long format using the .pivot() function.

The data engineer can rotate the data from long to wide format using the .pivot() function.

The data engineer can rotate the data from long to wide format using the .groupBy()clause.

The data engineer can rotate the data from long to wide format using the .transform()clause.

The data engineer can rotate the data from wide to long format using the .transform() clause.

The data engineer can rotate the data from long to wide format using the .pivot() function.

How well did you know this?

Not at all

Perfectly

A data engineer is registering a table in Databricks using the table users from an external SQL database. One of their colleagues gives them the following code to register the table. However, when the data engineer runs the code, they notice an error.

Code block:

CREATE TABLE users_jdbc
USING JDBC
OPTIONS (
url = “jdbc:sqlite:${DA.paths.ecommerce_db}”
)

Which of the following correctly identifies why running the code is resulting in an error? Select one response.

A username and password need to be added to OPTIONS.

USING JDBC needs to be changed to USING SQL.

None of these responses correctly identify the cause of the error.

CREATE TABLE needs to be changed to CREATE JDBC TABLE.

The line dbtable = “users” needs to be added to OPTIONS.

How well did you know this?

Not at all

Perfectly

A data engineer has a DataFrame events_df that has been registered against an external JSON file. The nested JSON fields have already been converted into struct types. The data engineer now needs to flatten the struct fields back into individual columns for the field event_type. The events_df DataFrame has the following schema:

date string
month string
event_type StructType<id string, size int>

Which of the following approaches allows the data engineer to retrieve id within event_type? Select one response.

They can use event_type.* to pull out id into its own column.

They can index the DataFrame by id.

They can use : syntax to access id in event_type.

They can use . syntax to access id in event_type.

They can use from_json() to parse the columns for id.

They can use . syntax to access id in event_type.

How well did you know this?

Not at all

Perfectly

A data engineer needs a reference to the results of a query that can be referenced across multiple queries within the scope of the environment session. The data engineer does not want the reference to exist outside of the scope of the environment session.

Which of the following approaches accomplishes this without explicitly dropping the data object? Select one response.

They can store the results of their query within a temporary view.

They can store the results of their query within a table.

They can store the results of their query within a view.

They can store the results of their query within a common table expression (CTE).

They can store the results of their query within a reusable user-defined function (UDF).

They can store the results of their query within a temporary view.

How well did you know this?

Not at all

Perfectly

A data engineer needs to query a JSON file whose location is represented by the variable path.

Which of the following commands do they need to use? Select one response.

SHOW TABLE json.${path};

DISPLAY TABLE json.${path};

RETURN json.${path};

SELECT * FROM path LOCATION ${path};

SELECT * FROM json.${path};

How well did you know this?

Not at all

Perfectly

A senior data engineer has registered the Python function create_users to be used by the rest of their team. They have written the function in Python 3.6. Another data engineer wants to know what the expected inputs are.

Which of the following is considered a best practice to do this? Select one response.

The data engineer can declare the function with @udf(“string”) syntax to specify the input and return types of the function.

The data engineer can add the input and output types to the table using @options()

The data engineer can add a comment to the table properties to clarify the input and return types of the function.

The data engineer can use type hints to clarify the input and return types of the function.

The data engineer can add a return string line to the end of their UDF to specify the input and return types of the function.

The data engineer can use type hints to clarify the input and return types of the function.

How well did you know this?

Not at all

Perfectly

A data engineer has a table records with a column email. They want to check if there are null values in the email column.

Which of the following approaches accomplishes this? Select one response.

They can check if there is at least one record where email is null using SELECT DISTINCT records.

They can check if there is at least one record where email is null by running a regular expression function on email to filter out null values.

They can check if there is at least one record where email is null by adding a filter for when email IS NULL to a SELECT statement.

They can check if there is at least one record where email is null by pivoting the table on null values.

They can check if there is at least one record where email is null by creating a data expectation to drop null values.

Study These Flashcards

They can check if there is at least one record where email is null by adding a filter for when email IS NULL to a SELECT statement.

A data engineer has a DataFrame with string column email_address. They are using a regular expression that returns a string with a matching pattern when it is in the following format:

user.address@domain.com

Which of the following lines of code creates a new column domain that contains the domain from the email_address column? Select one response.

.withColumn(“domain”, array_distinct(“email_address”, “(?<=@).+”, 0))

.withColumn(“domain”, regexp_extract(“email_address”, “(?<=@).+”, 0))

.withColumn(“domain”, flatten(“email_address”, “(?<=@).+”, 0))

.withColumn(“domain”, regexp_extract(“email_address”, “(?<=@).+”, 0))

.withColumn(“domain”, collect_set(“email_address”, “(?<=@).+”, 0))

Study These Flashcards

.withColumn(“domain”, regexp_extract(“email_address”, “(?<=@).+”, 0))

A data engineer wants to extract lines as raw strings from a text file.

Which of the following SQL commands accomplishes this task? Select one response.

SELECT (*) FROM ${dbfs:/mnt/datasets}/001.txt`;

SELECT * FROM text.${dbfs:/mnt/datasets}/001.txt;

SELECT text(*) FROM ${dbfs:/mnt/datasets}/001.txt`;

SELECT * FROM ${dbfs:/mnt/datasets}/001.txt as TEXT;

SELECT * FROM ${dbfs:/mnt/datasets}/001.txt;

Study These Flashcards

SELECT * FROM text.${dbfs:/mnt/datasets}/001.txt;

Which of the following commands returns a new DataFrame from the DataFrame usersDF without duplicates? Select one response.

usersDF.count().dropna()
usersDF.drop()
usersDF.select(*)
usersDF.distinct()
usersDF.groupBy(nulls)

Study These Flashcards

usersDF.distinct()

A data engineer has a query that directly updates the files underlying the external table emails.

Which of the following correctly describes how to retrieve the number of rows in the updated table? Select one response.

REFRESH TABLE emails;
SELECT COUNT(*) FROM emails WHEN UPDATED = TRUE;

REFRESH TABLE emails;
SELECT COUNT(*) FROM emails AS OF VERSION 1;

REFRESH TABLE emails;
SELECT DISTINCT_COUNT(*) FROM emails;

REFRESH TABLE emails;
SELECT DISTINCT_COUNT(*) FROM emails AS OF VERSION 1;

REFRESH TABLE emails;
SELECT COUNT(*) FROM emails;

Study These Flashcards

REFRESH TABLE emails;
SELECT COUNT(*) FROM emails;

A data engineer is using the following code block to create and register a function that returns the first letter of the string email. Another data engineer points out that there is a more efficient way to do this.

Code block:
from pyspark.sql.functions import udf

@udf(“string”)

def first_letter_function(email: str) -> str:

return email[0]

first_letter_udf = spark.udf.register(“sql_udf”, first_letter_function)

Which of the following identifies how the data engineer can eliminate redundancies in the code? Select one response.

They can eliminate the statement that registers the function.

They can eliminate the parameters in the function declaration.

They can eliminate “sql_udf” from the statement that registers the function.

They can eliminate the return statement at the end of the function.

They can eliminate the import statement in the beginning of the code block.

Study These Flashcards

They can eliminate the statement that registers the function.

A data engineer needs to create and register a user-defined function (UDF) CREATE_USER using the Python function createUser and apply it to array column username in the table users.They have the following lines of code.

Lines of code:

spark.sql(“SELECT createUser(username) AS user FROM users”)
spark.sql(“SELECT CREATE_USER(username) AS user FROM users”)
spark.udf.register(“CREATE_USER”, createUser)

4.spark.udf.register(createUser(username))

In what order do the lines of code above need to be run in a Python session in order to accomplish this?

Study These Flashcards

3, 2

A data engineer needs to extract the calendar date and time in human readable format from a DataFrame containing the timestamp column user_last_touch_timestamp.

Which of the following lines of code correctly fills in the blank by adding the column end_date of type date in human readable format? Select one response.

.withColumn(date_time(“end_date”),user_last_touch_timestamp, “HH:mm:ss”)

.withColumn(“end_date”, date_format(“user_last_touch_timestamp”, “MMM d, yyyy”))

.withColumn(“end_date”, CAST(user_last_touch_timestamp) as date_format)

.withColumn(date_format(“end_date”), user_last_touch_timestamp, “HH:mm:ss”)

.withColumn(date_time(“end_date”), user_last_touch_timestamp, “MMM d, yyyy”)

Study These Flashcards

.withColumn(“end_date”, date_format(“user_last_touch_timestamp”, “MMM d, yyyy”))

A data engineer has the following query, where path is a variable that represents the location of a directory.

Query:

SELECT * FROM csv.${path};

The query loads the contents of a directory of CSV files from a source table to a target table.

The query streams data from a directory of CSV files into a table.

The query converts a directory of files into CSV format.

The query displays the underlying file contents of a directory of CSV files.

The query displays the metadata of a directory of CSV files.

Study These Flashcards

The query displays the underlying file contents of a directory of CSV files.

A data engineer is using the following query to confirm that each unique string value in the phone_number column in the usersDF DataFrame is associated with at most one user_id. They want the query to return true if each phone_number is associated with at most 1 user_id. When they run the query, they notice that the query is not returning the expected result. Code block: from pyspark.sql.functions import countDistinct usersDF .groupBy("phone_number") .agg(countDistinct("user_id").alias("unique_user_ids")) Which of the following explains why the query is not returning the expected result? Select one response. .groupBy("phone_number") needs to be changed to count(*).when(user_id != null). A .select(max("unique_user_ids") <= 1)function needs to be added after the .agg() function. A .dropDuplicates() statement needs to be added after the .agg() function. A .merge statement on row_count == count(phone_number) needs to be added after the groupBy() function. .groupBy("phone_number") needs to be changed to .countDistinct(phone_number).

A .select(max("unique_user_ids") <= 1)function needs to be added after the .agg() function.

Transform Data with PySpark (Databricks Q) Flashcards

(25 cards)