DataFrame API Flashcards by sidney stein

The code block displayed below contains an error. The code block is intended to join DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.

Code block:

transactionsDf.join(itemsDf, “itemId”, how=”broadcast”)

A. The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.

B. broadcast is not a valid join type.

C. Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.

D. The join method should be replaced by the broadcast method.

E. The syntax is wrong, how= should be removed from the code block.

B. broadcast is not a valid join type.

Correct! The code block should read transactionsDf.join(broadcast(itemsDf), “itemId”). This would imply an inner join (this is the default in DataFrame.join()), but since the join type is not given in the question, this would be a valid choice.

How well did you know this?

Not at all

Perfectly

Which of the following code blocks performs a join in which the small DataFrame transactionsDf is sent to all executors where it is joined with DataFrame itemsDf on columns storeId and itemId, respectively?

A.
itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, “right_outer”)

B.
itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, “broadcast”)

C.
itemsDf.merge(transactionsDf, itemsDf.itemId == transactionsDf.storeId, “broadcast”)

D.
itemsDf.join(broadcast(transactionsDf), itemsDf.itemId == transactionsDf.storeId)

E.
itemsDf.join(transactionsDf, broadcast(itemsDf.itemId == transactionsDf.storeId))

The issue with all answers that have “broadcast” as very last argument is that “broadcast” is not a valid join type. While the entry with “right_outer” is a valid statement, it is not a broadcast join. The item where broadcast() is wrapped around the equality condition is not valid code in Spark. broadcast() needs to be wrapped around the name of the small DataFrame that should be broadcast.

How well did you know this?

Not at all

Perfectly

Which of the following code blocks efficiently converts DataFrame transactionsDf from 12 into 24 partitions?

A. transactionsDf.coalesce(24)

B. transactionsDf.repartition()

C. transactionsDf.repartition(“itemId”, 24)

D. transactionsDf.repartition(24, boost=True)

E. transactionsDf.repartition(24)

How well did you know this?

Not at all

Perfectly

Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?

A. DataFrame.repartition(6)

B. DataFrame.repartition(12)

C. DataFrame.coalesce(6)

D. DataFrame.coalesce(6, shuffle=True)

E. DataFrame.coalesce(6).shuffle()

A.
DataFrame.repartition(6)

Correct. repartition() always triggers a full shuffle (different from coalesce()).

How well did you know this?

Not at all

Perfectly

The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and transactionDate (in this order). Find the error.

Code block:
transactionsDf.coalesce(14, (“storeId”, “transactionDate”))

A. Operator coalesce needs to be replaced by repartition

B. Operator coalesce need to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.

C. Operator coalesce need to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.

D. Operator coalesce need to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.

E. The parentheses around the column names need to be removed, and .select() needs to be appended to the code block.

Correct code block:

transactionsDf.repartition(14, “storeId”, “transactionDate”).count()

Since we do not know how many partitions DataFrame transactionsDf has, we cannot safely use coalesce, since it would not make any change if the current number of partitions is smaller than 14. So, we need to use repartition.

In the Spark documentation, the call structure for repartition is shown like this: DataFrame.repartition(numPartitions, *cols). The * operator means that any argument after numPartitions will be interpreted as column. Therefore, the brackets need to be removed.

Finally, the question specifies that after the execution the DataFrame should be divided. So, indirectly this question is asking us to append an action to the code block. Since .select() is a transformation. the only possible choice here is .count().

How well did you know this?

Not at all

Perfectly

Which of the following code blocks removes all rows in the 6-column DataFrame transactionsDf that have missing data in at least 3 columns?

A. transactionsDf.dropna(thresh=2)

B. transactionsDf.dropna(“any”)

C. transactionsDf.dropna(thresh=4)

D. transactionsDf.drop.na(““,2)

E. transactionsDf.drop.na(““,4)

transactionsDf.dropna(thresh=4)

Correct. Note that by only working with the thresh keyword argument, the first how keyword argument is ignored. Also, figuring out which value to set for thresh can be difficult, especially when under pressure in the exam. Here, I recommend you use the notes to create a “simulation” of what different values for thresh would do to a DataFrame.

How well did you know this?

Not at all

Perfectly

The code block displayed below contains an error. The code block is intended to return all columns of DataFrame transactionsDf except for columns predError, productId, and value. Find the error.

Should be a table here it shows that the columns are all still in the table.

Code block:

transactionsDf.select(~col(“predError”), ~col(“productId”), ~col(“value”))

A. The select operator should be replaced by the drop operator.

B. The column names in the select operator should not be strings and wrapped in the col operator, so they should be expressed like select(~col(predError), ~col(productId), ~col(value)).

C. transactionsDf.drop(“predError”, “productId”, “value”)

D. The select operator should be replaced by the drop operator and the arguments to the drop operator should be column names predError, productId and value wrapped in the col operator so they should be expressed like drop(col(predError), col(productId), col(value)).

E.The select operator should be replaced with the deselect operator.

C. transactionsDf.drop(“predError”, “productId”, “value”)

How well did you know this?

Not at all

Perfectly

The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to 30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__((__2__.__3__) __4__ (__5__))

A.
1. select
2. col(“storeId”)
3. between(20, 30)
4. &
5. col(“productId”)==2

B.
1. select
2. col(“storeId”)
3. between(20, 30)
4. and
5. col(“productId”)==2

C.
1. where
2. col(“storeId”)
3. geq(20).leq( 30)
4. &
5. col(“productId”)==2

D.
1. select
2. col(“storeId”)
3. between(20, 30)
4. &&
5. col(“productId”)==2

E
1. select
2. “storeId”
3. between(20, 30)
4. &&
5. col(“productId”)==2

A. Correct code block:

transactionsDf.select((col(“storeId”).between(20, 30)) & (col(“productId”)==2))

Although this question may make you think that it asks for a filter or where statement, it does not. It asks explicity to return a column with booleans – this should point you to the select statement.

Another trick here is the rarely used between() method. It exists and resolves to ((storeId >= 20) AND (storeId <= 30)) in SQL. geq() and leq() do not exist.

Another riddle here is how to chain the two conditions. The only valid answer here is &. Operators like && or and are not valid. Other boolean operators that would be valid in Spark are | and ~.

How well did you know this?

Not at all

Perfectly

Which of the following code blocks returns a DataFrame where columns predError and productId are removed from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:
shows all columns in the DF, assocaiteId isn’t in there though.

A. transactionsDf.withColumnRemoved(“predError”, “productId”)

B. transactionsDf.dropColumns([“predError”, “productId”, “associateId”])

C. transactionsDf.drop(“predError”, “productId”, “associateId”)

D. transactionsDf.dropColumns(“predError”, “productId”, “associateId”)

E. transactionsDf.drop(col(“predError”, “productId”))

C. transactionsDf.drop(“predError”, “productId”, “associateId”)

The key here is to understand that columns that are passed to DataFrame.drop() are ignored if they do not exist in the DataFrame. So, passing column name associateId to transactionsDf.drop() does not have any effect.

Passing a list to transactionsDf.drop() is not valid. The documentation (link below) shows the call structure as DataFrame.drop(*cols). The * means that all arguments that are passed to DataFrame.drop() are read as columns. However, since a list of columns, for example [“predError”, “productId”, “associateId”] is not a column, Spark will run into an error.

How well did you know this?

Not at all

Perfectly

Which of the following code blocks returns a new DataFrame with only columns predError and values of every second row of DataFrame transactionsDf?

Entire DataFrame transactionsDf:
Shows DF has six rows

A. transactionsDf.filter(col(“transactionsId”) isin([2,4,6])).select([“predError”, “value”])

B. transactionsDf.select(col(“transactionsId”) isin([2,4,6])).select(“predError”, “value”)

C. transactionsDf.filter(“transactionsId” % 2 == 0).select(“predError”, “value”)

D. transactionsDf.select(“transactionsId” % 2 == 0).select(“predError”, “value”)

E. transactionsDf.filter(col(“transactionsId”) % 2 == 0).select(“predError”, “value”)

This is not an easy question to solve. You need to know that % stands for the module operator in Python. % 2 will return true for every second row. The statement using spark.sql gets it almost right (the modulo operator exists in SQL as well), but % 2 = 2 will never yield true, since modulo 2 is either 0 or 1.

Other answers are wrong since they are missing quotes around the column names and/or use filter or select incorrectly.

How well did you know this?

Not at all

Perfectly

Which of the following code blocks returns a one-column DataFrame of all values in column supplier of DataFrame itemsDf that do not contain the letter X? In the DataFrame, every value should only be listed once.

Sample of DataFrame itemsDf:

shows that suppliers with different names some with X

A. itemsDf.filter(col(supplier).not_contains(‘X’)).select(supplier).distinct()

B. itemsDf.select(~col(‘supplier’).contains(‘X’)).distinct()

C. itemsDf.filter(not(col(supplier).contains(‘X’))).select(‘supplier’).unique()

D. itemsDf.filter(~col(supplier).contains(‘X’)).select(‘supplier’).distinct()

E. itemsDf.filter(!col(supplier).contains(‘X’)).select(col(‘supplier’)).unique()

D. itemsDf.filter(~col(supplier).contains(‘X’)).select(‘supplier’).distinct()

Key to managing this question is understand which operator to use to do the opposite of an operation – the ~ (not) operator. In addition, you should know that there is no unique() method.

How well did you know this?

Not at all

Perfectly

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

A. from pyspark import StorageLevel
transactionsDf.persist(StorageLevel.MEMORY_ONLY)

B. transactionsDf.persist()

C. transactionsDf.clear_persist()

D. transactionsDf.storage_level(‘MEMORY_ONLY’)

E. from pyspark import StorageLevel
transactionsDf.cache(StorageLevel.MEMORY_ONLY)

A. from pyspark import StorageLevel
transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed.

How well did you know this?

Not at all

Perfectly

The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.

A sample of DataFrame itemsDf is below.
It does not explode the attributes
Code block:

itemsAttributesDf = itemsDf.explode(“attributes”).alias(“attribute”).select(“attribute”, “itemId”)

A. The alias() method needs to be called after the select() method.

B. explode() is not a method of DataFrame. explode() should be used inside the select() method instead.

C. The split() method should be used inside the select() method instead of the explode() method.

D. Since itemId is the index, it does not need to be an argument to the select() method.

E. The explode() method expects a Column object rather than a string.

B. explode() is not a method of DataFrame. explode() should be used inside the select() method instead.

The correct code block looks like this:

from pyspark.sql.functions import explode

itemsAttributesDf = itemsDf.select(“itemId”, explode(“attributes”).alias(“attribute”))

How well did you know this?

Not at all

Perfectly

In which order should the code blocks shown below be run in order to create a table of all values in column attributes next to the respective values in column supplier in DataFrame itemsDf?

itemsDf.createOrReplaceView(“itemsDf”)
spark.sql(“SELECT ‘supplier’, explode(‘Attributes’) FROM itemsDf”)
spark.sql(“SELECT supplier, explode(attributes) FROM itemsDf”)
itemsDf.createOrReplaceTempView(“itemsDf”)

A. 4, 3
B. 1,3
C. 2
D. 4,2
E. 1, 2

A. 4, 3

When initially reading this question, you may think that “next to the respective values” indicates there is a join operation involved. However, once inspecting the answer possibilities, you may notice that this question is really about the explode() function.

If this is the case, then we can make the assumption that, in every row, column supplier contains single values and column attributes contains arrays.

Given that attributes (the actual column name) is capitalized in answer spark.sql(“SELECT ‘supplier’, explode(‘Attributes’) FROM itemsDf”), this answer cannot be correct.

Now, how about the command which generates a view that can be access via SQL? Even though you might think DataFrame.createOrReplaceView() actually exists in PySpark, it actually does not. The only valid command here is DataFrame.createOrReplaceTempView().

How well did you know this?

Not at all

Perfectly

The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Code block:
transactionsDf.__1__(__2__).__3__(__4__)

A.
1. filter
2. col(“storeId”) == 25
3. collect
4. 5

B.
1. filter
2.storeId == 25
3. head
4. 5

C.
1. filter
2. col(“storeId”) == 25
3. take
4. 5

D.
1. filter
2. col(“storeId”) == 25
3. toLocalIterator
4. 5

E.
1. filter
2. “storeId” == 25
3. collect
4. 5

C.
The correct code block is:

transactionsDf.filter(col(“storeId”)==25).take(5)

Any of the options with collect will not work because collect does not take any arguments, and in both cases the argument 5 is given.

The option with toLocalIterator will not work because the only argument to toLocalIterator is prefetchPartitions which is a boolean, so passing 5 here does not make sense.

The option using head will not work because the expression passed to select is not proper syntax. It would work if the expression would be col(“storeId”)==25.

How well did you know this?

Not at all

Perfectly

Which of the following code blocks returns only rows from DataFrame transactionsDf in which values in column productId are unique?

A. transactionsDf.distinct(“productId”)

B. transactionsDf.dropDuplicates(subset=[“productId”])

C. transactionsDf.drop_Duplicates(subset=[“productId”])

D. transactionsDf.unique(“productId”)

E. transactionsDf.dropDuplicates(subset=”productId”)

Study These Flashcards

B. transactionsDf.dropDuplicates(subset=[“productId”])

Explanation
Although the question suggests using a method called unique() here, that method does not actually exist in PySpark. In PySpark, it is called distinct(). But then, this method is not the right one to use here, since with distinct() we could filter out unique values in a specific column.

However, we want to return the entire rows here. So the trick is to use dropDuplicates with the subset keyword parameter. In the documentation for dropDuplicates, the examples show that subset should be used with a list. And this is exactly the key to solving this question: The productId column needs to be fed into the subset argument in a list, even though it is just a single column.

The code block shown below should return all rows of DataFrame itemsDf that have at least 3 items in column itemNameElements. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Example of DataFrame itemsDf:

Code block:

itemsDf.__1__(__2__(__3__)__4__)

A.
1. select
2. count
3. col(“itemNameElements”)
4. >3

B.
1. filter
2. count
3. itemNameElements
4. >=3

C.
1. select
2. count
3.”itemNameElements”
4. >3

D.
1. filter
2. size
3. “itemNameElements”
4. >=3

E.
1. select
2. size
3. “itemNameElements”
4. >3

Study These Flashcards

itemsDf.filter(size(“itemNameElements”)>3)

The big difficulty with this question is in knowing the difference between count and size (refer to documentation below). size is the correct function to choose here since it returns the number of elements in an array on a per-row basis.

The other consideration for solving this question is the difference between select and filter. Since we want to return the rows in the original DataFrame, filter is the right choice. If we would use select, we would simply get a single-column DataFrame showing which rows match the criteria, like so:

Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type double?

A. spark.createDataFrame([(“summer”, 4.5), (“winter”, 7.5)], [“season”, “wind_speed_ms”])

B. spark.newDataFrame([(“summer”, 4.5), (“winter”, 7.5)], [“season”, “wind_speed_ms”])

C. spark.createDataFrame({“season”: [“winter”,”summer”], “wind_speed_ms”: [4.5, 7.5]})

D. spark.DataFrame({“season”: [“winter”,”summer”], “wind_speed_ms”: [4.5, 7.5]})

E.
from pyspark.sql import types as T
spark.createDataFrame(((“summer”, 4.5), (“winter”, 7.5)), T.StructType([T.StructField(“season”, T.CharType()), T.StructField(“season”, T.DoubleType())]))

Study These Flashcards

Correct. This command uses the Spark Session’s createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example (“summer”, 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that “when schema is a list of column names, the type of each column will be inferred from data” (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column “season” contain only strings, Spark will cast the column appropriately as string.

Find out more about SparkSession.createDataFrame() via the link below.

Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?

A. transactionsDf.sort(asc(value)).show(10)

B. transactionsDf.sort(col(“value”)).show(10)

C. transactionsDf.sort(col(“value”)).desc().head()

D.transactionsDf.sort(col(“value”).asc()).print(10)

E.transactionsDf.orderBy(“value”).asc().show(10)

Study These Flashcards

B. transactionsDf.sort(col(“value”)).show(10)

show() is the correct method to look for here, since the question specifically asks for displaying the rows in a nicely formatted way.
With regards to the sorting, specifically in ascending order since the smallest values should be shown first, the following expressions are valid:
- transactionsDf.sort(col(“value”)) (“ascending” is the default sort direction in the sort method)
- transactionsDf.sort(asc(col(“value”)))
- transactionsDf.sort(asc(“value”))
- transactionsDf.sort(transactionsDf.value.asc())
- transactionsDf.sort(transactionsDf.value)
Also, orderBy is just an alias of sort, so all of these expressions work equally well using orderBy.

The code block displayed below contains an error. The code block should use Python method find_most_freq_letter to find the letter present most in column itemName of DataFrame itemsDf and return it in a new column most_frequent_letter. Find the error.

Code block:

find_most_freq_letter_udf = udf(find_most_freq_letter)
itemsDf.withColumn(“most_frequent_letter”, find_most_freq_letter(“itemName”))

A. The UDF method is not registered correctly, since the return type is missing

B. UDFs do not exists in PySpark

C. Spark is not using the UDF method correctly

D. The “itemName” expression should be wrapped in col()

E. Spark is not adding a column

Study These Flashcards

Explanation
Correct code block:

find_most_freq_letter_udf = udf(find_most_freq_letter)
itemsDf.withColumn(“most_frequent_letter”, find_most_freq_letter_udf(“itemName”))
Spark should use the previously registered find_most_freq_letter_udf method here – but it is not doing that in the original codeblock. There, it just uses the non-UDF version of the Python method.

Note that typically, we would have to specify a return type for udf(). Except in this case, since the default return type for udf() is a string which is what we are expecting here. If we wanted to return an integer variable instead, we would have to register the Python function as UDF using find_most_freq_letter_udf = udf(find_most_freq_letter, IntegerType()).

The code block displayed below contains at least one error. The code block should return a DataFrame with only one column, result. That column should include all values in column value from DataFrame transactionsDf raised to the power of 5, and a null value for rows in which there is no value in column value. Find the error(s).

Code block:

from pyspark.sql.functions import udf
from pyspark.sql import types as T

transactionsDf.createOrReplaceTempView(‘transactions’)

def pow_5(x):
return x**5

spark.udf.register(pow_5, ‘power_5_udf’, T.LongType())
spark.sql(‘SELECT power_5_udf(value) FROM transactions’)

A. The pow_5 method is unable to handle empty values in column value and the name of the column in the returned DataFrame is not result

B. The returned DataFrame includes multiple columns instead of just one column

C. The pow_5 methods is unable to handle empty values in column value, he name of the column in the returned DataFrame is not result and the SparkSession cannot access the transationsDf DataFrame

D. The pow_5 methods is unable to handle empty values in column value, he name of the column in the returned DataFrame is not result and the Spark driver does not call the UDF function appropriately.

E. The pow_5 method is unable to handle empty values in column value, the UDF funciton is not registered properly with the Spark driver and the name of the column in the returned DataFrame is not results.

Study These Flashcards

Correct code block:

from pyspark.sql.functions import udf
from pyspark.sql import types as T

transactionsDf.createOrReplaceTempView(‘transactions’)

def pow_5(x):
if x:
return x**5
return x

spark.udf.register(‘power_5_udf’, pow_5, T.LongType())
spark.sql(‘SELECT power_5_udf(value) AS result FROM transactions’)
Here it is important to understand how the pow_5 method handles empty values. In the wrong code block above, the pow_5 method is unable to handle empty values and will throw an error, since Python’s ** operator cannot deal with any null value Spark passes into method pow_5.

The order of arguments for registering the UDF function with Spark via spark.udf.register matters. In the code snippet in the question, the arguments for the SQL method name and the actual Python function are switched. You can read more about the arguments of spark.udf.register and see some examples of its usage in the documentation (link below).

Finally, you should recognize that in the original code block, an expression to rename column created through the UDF function is missing. The renaming is done by SQL’s AS result argument. Omitting that argument, you end up with the column name power_5_udf(value) and not result.

Given a DataFrame df that has some null values in the column “created_date”, complete the code below such that it will sort rows in ascending order based on the column “created_date” with null values appearing last.

df.1(2)

A.
1 = orderBy
2 = asc_nulls_last(“created_date”)

B.
1 = orderBy
2 = col(“created_date”).asc_nulls_last()

C.
1 = sort
2 = asc_nulls_last(“created_date”)

D.
1 = orderBy
2 = col(“created_date”), ascending=True)

E.
1 = orderBy
2 = col(“created_date”).asc()

Study These Flashcards

B.
1 = orderBy
2 = col(“created_date”).asc_nulls_last()

Which of the following DataFrame operations is always classified as a narrow transformation?

A. DataFrame.sort()
B. DataFrame.distinct()
C. DataFrame.repartition()
D. DataFrame.select()
E. DataFrame.join()

Study These Flashcards

D. DataFrame.select()

In which order should the code blocks shown below be run in order to assign articlesDf a DataFrame that lists all items in column attributes ordered by the number of times these items occur, from most to least often?

Sample of DataFrame articlesDf:

articlesDf = articlesDf.select(explode(col(“attributes”)))
articlesDf = articlesDf.orderBy(“count”).select(“col”)
articlesDf = articlesDf.sort(“count”,ascending=False).select(“col”)
articlesDf = articlesDf.groupby(“col”).count()

A. 4, 5
B. 2, 5, 3
C. 2, 3 , 4
D. 5, 2
E. 2, 5, 4

Study These Flashcards

E.
Correct code block:

articlesDf = articlesDf.select(explode(col(‘attributes’)))
articlesDf = articlesDf.groupby(‘col’).count()
articlesDf = articlesDf.sort(‘count’,ascending=False).select(‘col’)

Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors? A. itemsDf.cache().count() B. itemsDf.cache(eager=True) C. cache(itemsDf) D. itemsDf.cache().filter() E. itemsDf.rdd.storeCopy()

A.itemsDf.cache().count() Caching means storing a copy of a partition on an executor, so it can be accessed quicker by subsequent operations, instead of having to be recalculated. cache() is a lazily-evaluated method of the DataFrame. Since count() is an action (while filter() is not), it triggers the caching process.

Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority? A. transactionsDf.sort("storeId", desc("productId")) B. transactionsDf.sort("storeId", sort(desc("productId"))) C. transactionsDf.order_by("storeId", desc("productId")) D. transactionsDf.sort("storeId", asc("productId"))

A. transactionsDf.sort("storeId", desc("productId")) In this question it is important to realize that you are asked to sort transactionDf by two columns. This means that the sorting of the second column depends on the sorting of the first column. So, any option that sorts the entire DataFrame (through chaining sort statements) will not work. The two columns need to be channeled through the same call to sort(). Also, order_by is not a valid DataFrame API method.

Which of the following code blocks reorders the values inside the arrays in column attributes of DataFrame itemsDf from last to first one in the alphabet? A. itemsDf.withColumn('atributes', sort_array(col('attributes').desc())) B. itemsDf.withColumn('atributes', sort_array(desc('attributes'))) C. itemsDf.withColumn('atributes', sort_array(col('attributes'), asc= False)) D. itemsDf.withColumn("atributes", sort_array("attributes", asc= False)) E. itemsDf.select(sort_array('attributes'))

D. itemsDf.withColumn("atributes", sort_array("attributes", asc= False)) It can be confusing to differentiate between the different sorting functions in PySpark. In this case, a particularity about sort_array has to be considered: The sort direction is given by the second argument, not by the desc method. Luckily, this is documented in the documentation (link below). Also, for solving this question you need to understand the difference between sort and sort_array. With sort, you cannot sort values in arrays. Also, sort is a method of DataFrame, while sort_array is a method of pyspark.sql.functions.

The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively. Find the error. Code block: spark.createDataFrame([("red",), ("blue",), ("green",)], "color") A. Instead of calling spark.createDataFrame, just DataFrame should be called B. The commas in the tuples with the colors should be eliminated C. The colors red, blue and green should be expressed as a simple Python list, and not a list of tuples. D. Instead of color, a data type should be specified E. The "color" expression needs to be wrapped in brackets, so it reads ["color"]

E. Correct code block: spark.createDataFrame([("red",), ("blue",), ("green",)], ["color"]) The createDataFrame syntax is not exactly straightforward, but luckily the documentation (linked below) provides several examples on how to use it. It also shows an example very similar to the code block presented here which should help you answer this question correctly.

Which of the following code blocks returns a single-column DataFrame showing the number of words in column supplier of DataFrame itemsDf? Sample of DataFrame itemsDf: +------+-----------------------------+-------------------+ |itemId|attributes |supplier | +------+-----------------------------+-------------------+ |1 |[blue, winter, cozy] |Sports Company Inc.| |2 |[red, summer, fresh, cooling]|YetiX | A. itemsDf.split("supplier", " ").count() B. itemsDf.split("supplier", " ").size() C. itemsDf.select(word_count("supplier")) D. spark.select(size(split(col("supplier", " "))) E. itemsDf.select(size(split("supplier", " ")))

E. This question shows a typical use case for the split command: Splitting a string into words. An additional difficulty is that you are asked to count the words. Although it is tempting to use the count method here, the size method (as in: size of an array) is actually the correct one to use. Familiarize yourself with the split and the size methods using the linked documentation below.

The code block shown below should return a single-column DataFrame with a column named consonant_ct that, for each row, shows the number of consonants in column itemName of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this. DataFrame itemsDf: Code block: itemsDf.select(__1__(__2__(__3__(__4__), "a|e|i|o|u|\s", "")).__5__("consonant_ct")) A. 1. length 2. regex_replace 3. lower 4. col("itemName") 5. alias B. 1. length 2. regex_extract 3. lower 4. col("itemName") 5. alias C. 1. lower 2. regex_replace 3. length 4. "itemName" 5. alias D. 1. size 2. regex_replace 3. length 4. "itemName" 5. alias E. 1. length 2. regex_extract 3. upper 4. col("itemName") 5. as

A. Correct code block: itemsDf.select(length(regexp_replace(lower(col("itemName")), "a|e|i|o|u|\s", "")).alias("consonant_ct")) This question tries to make you think about the string functions Spark provides and in which order they should be applied. Arguably the most difficult part, the regular expression "a|e|i|o|u|\s", is not a numbered blank. However, if you are not familiar with the string functions, it may be a good idea to review those before the exam. The size operator and the length operator can easily be confused. size works on arrays, while length works on strings. Luckily, this is something you can read up about in the documentation. The code block works by first converting all uppercase letters in column itemName into lowercase (the lower() part). Then, it replaces all vowels by "nothing" - an empty character "" (the regexp_replace() part). Now, only lowercase characters without spaces are included in the DataFrame. Then, per row, the length operator counts these remaining characters. Note that column itemName in itemsDf does not include any numbers or other characters, so we do not need to make any provisions for these. Finally, by using the alias() operator, we rename the resulting column to consonant_ct.

Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId? A. transactionsDf.groupBy(col(storeId).avg()) B. transactionsDf.groupBy("storeId").avg(col("value)) C. transactionsDf.groupBy("storeId").agg(avg("value)) D. transactionsDf.groupBy("storeId").agg(average("value)) E. transactionsDf.groupBy("value").average()

C.transactionsDf.groupBy("storeId").agg(avg("value)) This question tests your knowledge about how to use the groupBy and agg pattern in Spark. Using the documentation, you can find out that there is no average() method in pyspark.sql.functions.

The code block shown below should set the number of partitions that Spark uses when shuffling data for joins or aggregations to 100. Choose the answer that correctly fills the blanks in the code block to accomplish this. spark.sql.shuffle.partitions __1__.__2__.__3__(__4__, 100) A. 1. spark 2. conf 3. set 4. "spark.sql.shuffle.partitions" B. 1. pyspark 2. config 3. set 4. spark.sql.shuffle.partitions C. 1. spark 2. conf 3. get 4. "spark.sql.shuffle.partitions" D. 1. pyspark 2. config 3. set 4. "spark.sql.shuffle.partitions" E 1. spark 2. conf 3. set 4. "spark.sql.aggregate.partitions"

A. Correct code block: spark.conf.set("spark.sql.shuffle.partitions", 100) The conf interface is part of the SparkSession, so you need to call it through spark and not pyspark. To configure spark, you need to use the set method, not the get method. get reads a property, but does not write it. The correct property to achieve what is outlined in the question is spark.sql.shuffle.partitions, which needs to be passed to set as a string. Properties spark.shuffle.partitions and spark.sql.aggregate.partitions do not exist in Spark.

The code block displayed below contains an error. The code block should configure Spark so that DataFrames up to a size of 20 MB will be broadcast to all worker nodes when performing a join. Find the error. Code block: spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 20) A. The correct option to write configurations is through spark.config and not spark.conf B. Spark will only apply the limit to threshold joins and not to other joins C. Spark will only broadcast DataFrames that are much smaller than the default value D. The passed limit has the wrong variable type E. The command is evaluated lazily and needs to be followed by an action

C. Spark will only broadcast DataFrames that are much smaller than the default value. This is correct. The default value is 10 MB (10485760 bytes). Since the configuration for spark.sql.autoBroadcastJoinThreshold expects a number in bytes (and not megabytes), the code block sets the limits to merely 20 bytes, instead of the requested 20 * 1024 * 1024 (= 20971520) bytes.

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error. Code block: transactionsDf.filter(col('predError').in([3, 6])).count() A. The number of rows cannot be determined with the count() operator B. The method used on column predError is incorrect C. Instead of filer, the select method should be used D. Numbers 3 and 6 need to be passed as string variables E. Instead of a list, the values need to be passed as a single arguments to the in operator

B. Correct code block: transactionsDf.filter(col('predError').isin([3, 6])).count() The isin method is the correct one to use here – the in method does not exist for the Column object.

Which of the following code blocks creates a new 6-column DataFrame by appending the rows of the 6-column DataFrame yesterdayTransactionsDf to the rows of the 6-column DataFrame todayTransactionsDf, ignoring that both DataFrames have different column names? A. todayTransactionsDf.union(yesterdayTransactionsDf) B. todayTransactionsDf.concat(yesterdayTransactionsDf)

A. todayTransactionsDf.union(yesterdayTransactionsDf)

DataFrame API Flashcards

(35 cards)