Rows Flashcards
How do you filter a dataframe based on values being greater than using where
.where(year($”birthdate”) > 1980)
How do you filter, using filter, to a specific month
.filter(month(`birthdate) === 1)
Using a SQL expression, how do you filter a dataframe
.where(“date(birthdate)>15”)
How do you check for inequality when filtering
=!=
How do you make sure your data frame only has unique values taking all columns into account
.distinct
How can your remove duplicates and only take one column into account
.dropDuplicates(“column”)
How can your remove duplicates and only take multiple columns into account
.dropDuplicates(List(“column1”, “column2”))
if dropDistinct does not have any columns passed in, what columns are taken into account to determine distinct values or does it fail?
All columns, equivalent to .distinct
How can you filter out null values from a dataframe
.where($”column_object”.isNotNull)
How can you drop rows with all null values
.na.drop(how=”all”)
How to remove a row where the any value in the row is null
.na.drop(“any”)
How do you remove the row if two specific columns have nulls
.na.drop(“all”, Seq(“column_a”, “column_b”))
How do you remove a row with a null value in either of two columns
.na.drop(“any”, Seq(“column_a”, “column_b”))
How can you replace all null values with “nope”
.na.fill(“nope”)
True/False
.na.fill(“Nope”) will only replace null where the column type is string
True
What is the default sorting order
ascending
how can you sort
.sort(“column_name”)
.orderBy()
.sort(expr(“column_name”)
How do you sort using an expression
.sort(expr(“column_name”))
Sort by month in a column where the type is date, and sort in descending order
.orderBy(expr(“month(birthdate)”).desc)
If you have two columns, one for customer id and one for items ids, how can you show how many items are associated with each customer
.groupBy(“customer_id”).agg(count(“item_id”).alias(“total”))
What is it called when an operation that causes spark to move data across the cluster
A Shuffle
When one input partition contributes to multiple output partitions, what is it called
Wide Transformation or Wide Dependencies
What are some examples of wide transformations
GroupBy Join Distinct Repartion Coalesce OrderBy
for the SQL year function, what can be passed in
Column object only
For the max sql function, what can be passed in
String or Column object
How can you get the max of a column and the min of another column
.agg(
max(“column_name”),
min(“other_column”))
How do you count rows where there is a value in the column “price”
.select(count(“price”))
What is a way to pull the common aggregate values for each column in a dataframe
.describe()
.describe takes what as input
column names
.desc function takes in what input
coumn names