DataFrame API Flashcards
The code block displayed below contains an error. The code block is intended to join DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.
Code block:
transactionsDf.join(itemsDf, “itemId”, how=”broadcast”)
A. The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
B. broadcast is not a valid join type.
C. Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
D. The join method should be replaced by the broadcast method.
E. The syntax is wrong, how= should be removed from the code block.
B. broadcast is not a valid join type.
Correct! The code block should read transactionsDf.join(broadcast(itemsDf), “itemId”). This would imply an inner join (this is the default in DataFrame.join()), but since the join type is not given in the question, this would be a valid choice.
Which of the following code blocks performs a join in which the small DataFrame transactionsDf is sent to all executors where it is joined with DataFrame itemsDf on columns storeId and itemId, respectively?
A.
itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, “right_outer”)
B.
itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, “broadcast”)
C.
itemsDf.merge(transactionsDf, itemsDf.itemId == transactionsDf.storeId, “broadcast”)
D.
itemsDf.join(broadcast(transactionsDf), itemsDf.itemId == transactionsDf.storeId)
E.
itemsDf.join(transactionsDf, broadcast(itemsDf.itemId == transactionsDf.storeId))
D.
The issue with all answers that have “broadcast” as very last argument is that “broadcast” is not a valid join type. While the entry with “right_outer” is a valid statement, it is not a broadcast join. The item where broadcast() is wrapped around the equality condition is not valid code in Spark. broadcast() needs to be wrapped around the name of the small DataFrame that should be broadcast.
Which of the following code blocks efficiently converts DataFrame transactionsDf from 12 into 24 partitions?
A. transactionsDf.coalesce(24)
B. transactionsDf.repartition()
C. transactionsDf.repartition(“itemId”, 24)
D. transactionsDf.repartition(24, boost=True)
E. transactionsDf.repartition(24)
E. transactionsDf.repartition(24)
Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?
A. DataFrame.repartition(6)
B. DataFrame.repartition(12)
C. DataFrame.coalesce(6)
D. DataFrame.coalesce(6, shuffle=True)
E. DataFrame.coalesce(6).shuffle()
A.
DataFrame.repartition(6)
Correct. repartition() always triggers a full shuffle (different from coalesce()).
The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and transactionDate (in this order). Find the error.
Code block:
transactionsDf.coalesce(14, (“storeId”, “transactionDate”))
A. Operator coalesce needs to be replaced by repartition
B. Operator coalesce need to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.
C. Operator coalesce need to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.
D. Operator coalesce need to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.
E. The parentheses around the column names need to be removed, and .select() needs to be appended to the code block.
D.
Correct code block:
transactionsDf.repartition(14, “storeId”, “transactionDate”).count()
Since we do not know how many partitions DataFrame transactionsDf has, we cannot safely use coalesce, since it would not make any change if the current number of partitions is smaller than 14. So, we need to use repartition.
In the Spark documentation, the call structure for repartition is shown like this: DataFrame.repartition(numPartitions, *cols). The * operator means that any argument after numPartitions will be interpreted as column. Therefore, the brackets need to be removed.
Finally, the question specifies that after the execution the DataFrame should be divided. So, indirectly this question is asking us to append an action to the code block. Since .select() is a transformation. the only possible choice here is .count().
Which of the following code blocks removes all rows in the 6-column DataFrame transactionsDf that have missing data in at least 3 columns?
A. transactionsDf.dropna(thresh=2)
B. transactionsDf.dropna(“any”)
C. transactionsDf.dropna(thresh=4)
D. transactionsDf.drop.na(““,2)
E. transactionsDf.drop.na(““,4)
C.
transactionsDf.dropna(thresh=4)
Correct. Note that by only working with the thresh keyword argument, the first how keyword argument is ignored. Also, figuring out which value to set for thresh can be difficult, especially when under pressure in the exam. Here, I recommend you use the notes to create a “simulation” of what different values for thresh would do to a DataFrame.
The code block displayed below contains an error. The code block is intended to return all columns of DataFrame transactionsDf except for columns predError, productId, and value. Find the error.
Should be a table here it shows that the columns are all still in the table.
Code block:
transactionsDf.select(~col(“predError”), ~col(“productId”), ~col(“value”))
A. The select operator should be replaced by the drop operator.
B. The column names in the select operator should not be strings and wrapped in the col operator, so they should be expressed like select(~col(predError), ~col(productId), ~col(value)).
C. transactionsDf.drop(“predError”, “productId”, “value”)
D. The select operator should be replaced by the drop operator and the arguments to the drop operator should be column names predError, productId and value wrapped in the col operator so they should be expressed like drop(col(predError), col(productId), col(value)).
E.The select operator should be replaced with the deselect operator.
C. transactionsDf.drop(“predError”, “productId”, “value”)
The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to 30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__((__2__.__3__) __4__ (__5__))
A.
1. select
2. col(“storeId”)
3. between(20, 30)
4. &
5. col(“productId”)==2
B.
1. select
2. col(“storeId”)
3. between(20, 30)
4. and
5. col(“productId”)==2
C.
1. where
2. col(“storeId”)
3. geq(20).leq( 30)
4. &
5. col(“productId”)==2
D.
1. select
2. col(“storeId”)
3. between(20, 30)
4. &&
5. col(“productId”)==2
E
1. select
2. “storeId”
3. between(20, 30)
4. &&
5. col(“productId”)==2
A. Correct code block:
transactionsDf.select((col(“storeId”).between(20, 30)) & (col(“productId”)==2))
Although this question may make you think that it asks for a filter or where statement, it does not. It asks explicity to return a column with booleans – this should point you to the select statement.
Another trick here is the rarely used between() method. It exists and resolves to ((storeId >= 20) AND (storeId <= 30)) in SQL. geq() and leq() do not exist.
Another riddle here is how to chain the two conditions. The only valid answer here is &. Operators like && or and are not valid. Other boolean operators that would be valid in Spark are | and ~.
Which of the following code blocks returns a DataFrame where columns predError and productId are removed from DataFrame transactionsDf?
Sample of DataFrame transactionsDf:
shows all columns in the DF, assocaiteId isn’t in there though.
A. transactionsDf.withColumnRemoved(“predError”, “productId”)
B. transactionsDf.dropColumns([“predError”, “productId”, “associateId”])
C. transactionsDf.drop(“predError”, “productId”, “associateId”)
D. transactionsDf.dropColumns(“predError”, “productId”, “associateId”)
E. transactionsDf.drop(col(“predError”, “productId”))
C. transactionsDf.drop(“predError”, “productId”, “associateId”)
The key here is to understand that columns that are passed to DataFrame.drop() are ignored if they do not exist in the DataFrame. So, passing column name associateId to transactionsDf.drop() does not have any effect.
Passing a list to transactionsDf.drop() is not valid. The documentation (link below) shows the call structure as DataFrame.drop(*cols). The * means that all arguments that are passed to DataFrame.drop() are read as columns. However, since a list of columns, for example [“predError”, “productId”, “associateId”] is not a column, Spark will run into an error.
Which of the following code blocks returns a new DataFrame with only columns predError and values of every second row of DataFrame transactionsDf?
Entire DataFrame transactionsDf:
Shows DF has six rows
A. transactionsDf.filter(col(“transactionsId”) isin([2,4,6])).select([“predError”, “value”])
B. transactionsDf.select(col(“transactionsId”) isin([2,4,6])).select(“predError”, “value”)
C. transactionsDf.filter(“transactionsId” % 2 == 0).select(“predError”, “value”)
D. transactionsDf.select(“transactionsId” % 2 == 0).select(“predError”, “value”)
E. transactionsDf.filter(col(“transactionsId”) % 2 == 0).select(“predError”, “value”)
E. transactionsDf.filter(col(“transactionsId”) % 2 == 0).select(“predError”, “value”)
This is not an easy question to solve. You need to know that % stands for the module operator in Python. % 2 will return true for every second row. The statement using spark.sql gets it almost right (the modulo operator exists in SQL as well), but % 2 = 2 will never yield true, since modulo 2 is either 0 or 1.
Other answers are wrong since they are missing quotes around the column names and/or use filter or select incorrectly.
Which of the following code blocks returns a one-column DataFrame of all values in column supplier of DataFrame itemsDf that do not contain the letter X? In the DataFrame, every value should only be listed once.
Sample of DataFrame itemsDf:
shows that suppliers with different names some with X
A. itemsDf.filter(col(supplier).not_contains(‘X’)).select(supplier).distinct()
B. itemsDf.select(~col(‘supplier’).contains(‘X’)).distinct()
C. itemsDf.filter(not(col(supplier).contains(‘X’))).select(‘supplier’).unique()
D. itemsDf.filter(~col(supplier).contains(‘X’)).select(‘supplier’).distinct()
E. itemsDf.filter(!col(supplier).contains(‘X’)).select(col(‘supplier’)).unique()
D. itemsDf.filter(~col(supplier).contains(‘X’)).select(‘supplier’).distinct()
Key to managing this question is understand which operator to use to do the opposite of an operation – the ~ (not) operator. In addition, you should know that there is no unique() method.
Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?
A. from pyspark import StorageLevel
transactionsDf.persist(StorageLevel.MEMORY_ONLY)
B. transactionsDf.persist()
C. transactionsDf.clear_persist()
D. transactionsDf.storage_level(‘MEMORY_ONLY’)
E. from pyspark import StorageLevel
transactionsDf.cache(StorageLevel.MEMORY_ONLY)
A. from pyspark import StorageLevel
transactionsDf.persist(StorageLevel.MEMORY_ONLY)
Correct. Note that the storage level MEMORY_ONLY means that all partitions that do not fit into memory will be recomputed when they are needed.
The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.
A sample of DataFrame itemsDf is below.
It does not explode the attributes
Code block:
itemsAttributesDf = itemsDf.explode(“attributes”).alias(“attribute”).select(“attribute”, “itemId”)
A. The alias() method needs to be called after the select() method.
B. explode() is not a method of DataFrame. explode() should be used inside the select() method instead.
C. The split() method should be used inside the select() method instead of the explode() method.
D. Since itemId is the index, it does not need to be an argument to the select() method.
E. The explode() method expects a Column object rather than a string.
B. explode() is not a method of DataFrame. explode() should be used inside the select() method instead.
The correct code block looks like this:
from pyspark.sql.functions import explode
itemsAttributesDf = itemsDf.select(“itemId”, explode(“attributes”).alias(“attribute”))
In which order should the code blocks shown below be run in order to create a table of all values in column attributes next to the respective values in column supplier in DataFrame itemsDf?
- itemsDf.createOrReplaceView(“itemsDf”)
- spark.sql(“SELECT ‘supplier’, explode(‘Attributes’) FROM itemsDf”)
- spark.sql(“SELECT supplier, explode(attributes) FROM itemsDf”)
- itemsDf.createOrReplaceTempView(“itemsDf”)
A. 4, 3
B. 1,3
C. 2
D. 4,2
E. 1, 2
A. 4, 3
When initially reading this question, you may think that “next to the respective values” indicates there is a join operation involved. However, once inspecting the answer possibilities, you may notice that this question is really about the explode() function.
If this is the case, then we can make the assumption that, in every row, column supplier contains single values and column attributes contains arrays.
Given that attributes (the actual column name) is capitalized in answer spark.sql(“SELECT ‘supplier’, explode(‘Attributes’) FROM itemsDf”), this answer cannot be correct.
Now, how about the command which generates a view that can be access via SQL? Even though you might think DataFrame.createOrReplaceView() actually exists in PySpark, it actually does not. The only valid command here is DataFrame.createOrReplaceTempView().
The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(__2__).__3__(__4__)
A.
1. filter
2. col(“storeId”) == 25
3. collect
4. 5
B.
1. filter
2.storeId == 25
3. head
4. 5
C.
1. filter
2. col(“storeId”) == 25
3. take
4. 5
D.
1. filter
2. col(“storeId”) == 25
3. toLocalIterator
4. 5
E.
1. filter
2. “storeId” == 25
3. collect
4. 5
C.
The correct code block is:
transactionsDf.filter(col(“storeId”)==25).take(5)
Any of the options with collect will not work because collect does not take any arguments, and in both cases the argument 5 is given.
The option with toLocalIterator will not work because the only argument to toLocalIterator is prefetchPartitions which is a boolean, so passing 5 here does not make sense.
The option using head will not work because the expression passed to select is not proper syntax. It would work if the expression would be col(“storeId”)==25.