Spark Dataframe commands Flashcards
Describe a dataframe in your own words
Dataframe is like a table with rows and columns
How do you read a table from Hive into Spark dataframe with select statement- Spark 2.6
spark.sql(“Select * from db.mytable”)
How do you read a table from Hive into Spark dataframe without select statement- Spark 2.6
spark.table(“db.myTable”)
How to display a dataframe
df.show()
How to display exactly 100 rows of a dataframe
df.show(100)
Why do we pass True/False in show
True expands the columns and False compresses the columns during show
Select specific columns
df.select(‘col1’,’col2’,’col3’)
Can I pass a list of columns within the select statement
Yes. df.select([‘col1’, ‘col2’,’col3’])
How do I change the column name without using withColumnRenamed?
df.selectExpr(“col1”,”col2 as test_column”)
How do I pull specific rows from a dataframe - For example where a certain column in my dataframe is true
df.filter(“col1 = True”)
import statement to import functions
from pyspark.sql import functions as func
Get the total number of records in a dataframe
df.count()
How do I get the count of distinct values in a column?
df.dropDuplicates(“col1”).count()
What is the difference between df.dropDuplicates() and df.dropDuplicates(“col1”)
dropDuplicates() drops the duplicates in the entire dataframe and dropDuplicates(“col1”) just drops the duplicates in specific column
How do I see the schema of a dataframe
df.printSchema()
How do I see the column names along with the datatypes
df.printSchema()
How do I retrieve the columns to a python list
df.columns
df.columns() - Is this correct and what will be the output
No. The braces shouldnt be present. It throws an error
How do I drop a column from a dataframe?
df.drop(“col1”)
df.drop([“col1”,”col2”]) - Is this correct and why
yes it is correct. We can pass a list in drop function
Groupby syntax with count
df.groupBy(“col1”).agg(func.count(“col2”))
Order the rows in a dataframe on a certain column.
df.orderBy(func.asc(“col1”))
Groupby on multiple columns syntax with count
df.groupBy(“col1”,”col2”).agg(func.count(“col2”))
Order the rows in a dataframe on multiple columns.
df.orderBy(func.asc(“col1”), func.desc(“col2”))
case expression general syntax
case when col1 = ‘Y’ then ‘True’ when col1 = ‘N’ then ‘False’ else ‘NA’ end
drop multiple columns
df.drop(“col1”,”col2”)